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Preface 



Embedded computek systems are now everywhere: from 
alarm clocks to PDAs, from mobile phones to cars, almost all the 
devices we use are controlled by embedded computers. An 
important class of embedded computer systems is that of hard 
real-time systems, which have to fulfill strict timing require- 
ments. As real-time systems become more complex, they are 
often implemented using distributed heterogeneous architec- 
tures. 

This book presents analysis and s5mthesis methods for heter- 
ogeneous, distributed hard real-time embedded systems. Such 
systems are used in many application areas like automotive 
electronics, real-time multimedia, avionics, medical equipment, 
and factory systems. These systems are heterogeneous not only 
in terms of platforms and communication protocols, but also in 
terms of scheduling approaches. 

The proposed analysis and S3uithesis techniques derive opti- 
mized implementations that fulfill the imposed design con- 
straints. An important part of the implementation process is the 
synthesis of the commimication infrastructure, which has a sig- 
nificant impact on the overall system performance and cost. 

To reduce the time-to-market of products, the design of real- 
time systems seldom starts from scratch. Typically, designers 
start from an already existing system, running certain applica- 
tions, and the design problem is to implement new functionality 
on top of this system. Hence, in addition to the analysis and syn- 
thesis methods proposed, we also consider mapping and sched- 
uling within such an incremental design process. Supporting 
incremental design provides a high degree of flexibility, and can 
result in important reductions of design costs. 
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The book is structured in four parts, and has ten chapters. 

In the first part we present the methodologies commonly used 
in embedded systems design, and we introduce the application 
model considered, a control and dataflow graph based represen- 
tation, called conditional process graph. We also present the 
hardware and software architectures considered, including the 
details of the communication protocols used: the time triggered 
protocol (TTP), which is a time-driven protocol based on a time- 
division multiple access bus access scheme, and the controller 
area network protocol (CAN), an event-driven communication 
protocol emplo 5 dng a collision avoidance scheme. 

The second part presents analysis and synthesis methods for 
time-driven systems, where processes are activated according to 
a time-triggered policy. Messages are transmitted using the TTP, 
while the scheduling of processes is performed using static cyclic 
scheduling. In such a context, we discuss the static cyclic sched- 
uling of systems with data and control dependencies, and inves- 
tigate the mapping and scheduling tasks in the context of an 
incremental design approach. 

In the third part we provide an analysis and develop tech- 
niques for the S5aithesis of event-driven systems, where the acti- 
vation of processes is done at the occurrence of significant 
events. The scheduling of processes is performed using fixed-pri- 
ority preemptive scheduling. Although a natural mapping of 
event-driven messages would be on a CAN bus, considering pre- 
emptive priority based scheduling at the process level, with time 
triggered static scheduling at the communication level can be 
the right solution under several circumstances. Therefore, in 
part three we will consider that messages produced by event- 
triggered processes are transmitted using the TTP, and we have 
developed four message passing policies for transmitting event- 
triggered messages over a time-triggered bus. Optimization 
strategies that derive the parameters of the communication pro- 
tocol are also presented. 
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The fourth part combines time-driven and event-driven sys- 
tems into heterogeneous networks, called multi-cluster systems. 
A multi-cluster system consists of several clusters, intercon- 
nected hy gateways. A cluster is composed of nodes which share a 
broadcast communication channel. In a time-triggered cluster, 
processes and messages are scheduled according to a static 
cyclic policy, with the bus implementing the TTP. On an event- 
triggered cluster, the processes are scheduled according to a pri- 
ority based preemptive approach, while messages are transmit- 
ted using the priority-based CAN protocol. We propose a 
schedulability analysis technique for such systems, together 
with optimization strategies for synthesizing a communication 
infrastructure that guarantees the timing constraints of the 
application. We also address design problems which are charac- 
teristic to multi-clusters: partitioning of the system functional- 
ity into time-triggered and event-triggered domains, the 
mapping of processes to the heterogeneous nodes of a cluster, 
and the packing of application messages to frames in the case of 
TTP and CAN protocols. 

The approaches presented in the book have been evaluated 
using a real-life case study consisting of a vehicle cruise control- 
ler, and an extensive set of S3mthetic applications generated for 
evaluation purposes. 

The authors are grateful to Olof Bridal, Magnus Hellring, and 
Henrik Lonn at Volvo Technology Corporation and Jakob Axels- 
son at Volvo Car Corporation for their close involvement and 
valuable feedback during this work. We would like to thank Rolf 
Ernst from Technical University of Braunschweig, Germany, 
Hans Hansson from Malardalen University, Sweden, Axel 
Jantsch from Royal Institute of Technology, Sweden, and Simin 
Nadjm-Tehrani from Linkoping University, Sweden, for reading 
an early draft of this book. Many thanks also to Traian Pop and 
Viacheslav Izosimov for their help with the implementation of 
some of the algorithms used in the book. 



XIX 




PART I 

Preliminaries 




Chapter 1 
Introduction 



The first modern computers occupied entire rooms, had 
thousands of vacuum tubes, and dissipated hundreds of kilo- 
watts of heat, but could only execute a couple of thousands of 
simple instructions per second [EB03a] . Today, a complex micro- 
processor, which dwarfs the performance of the first electronic 
computers, can be integrated into a digital wristwatch. 

This extraordinary development is due to the microelectronics 
revolution that, according to Moore’s law, allows us to double the 
number of transistors integrated on a single chip every 18 
months, from 2,300 in the first Intel 4004 chip to 42 million in 
the latest Pentium 4 microprocessor [EB03b] . 

Not only have digital systems become more powerful, and 
integrated an increasing number of transistors, but their cost 
has also dropped dramatically. This has led to a situation where 
we have a huge amount of very cheap computation power avail- 
able on a very small physical size, allowing the digital systems 
to be present in every aspect of our daily lives. 

Nowadays, digital systems are everjrwhere. We are sur- 
rounded by desktop computers, mobile phones, personal digital 
assistants (PDAs), pagers, scanners, DVD players, VCRs, video 
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game consoles, fax machines, digital cameras, home security 
systems, electronic toys, card readers, ATMs, cars, trains, air- 
planes, etc., all of which contain in some form or another a digi- 
tal system. 

Whenever the digital systems augment or control a function of 
a host object or system, we say that these digital systems are 
embedded into the host system, hence the term embedded sys- 
tem. Out of the digital systems mentioned above, only the desk- 
top computer is not an embedded system. The desktop computer 
is a general purpose system that can be programmed to imple- 
ment virtually any type of function. In contrast, embedded sys- 
tems are not general purpose systems, rather, their functionality 
is dedicated to perform a limited set of functions, required by the 
host system. 

Recently, the number of embedded systems in use has become 
larger than the number of humans on the planet [Aar03]. That is 
not difficult to believe, considering that more than 99% of the 
microprocessors produced today are used in embedded systems 
[Tur99] and the number of microprocessors manufactured has 
been increasing rapidly over many years. Although the number 
of embedded systems and their diversity is huge, they share a 
small, but significant, set of common characteristics: 

• Embedded systems are designed to perform a dedicated set 
of functions. 

For example, a digital camera is designed principally for 
taking pictures; it cannot be programmed by the user, for 
example, to solve differential equations. 

• Embedded systems have to perform under very tight, varied, 
and competing constraints. 

A wearable blood pressure monitor has to be small 
enough to be placed on the wrist, a mobile phone has to con- 
sume very little power so the battery lasts for a couple of 
weeks, and in-vehicle electronics have to perform under tight 
timing constraints. 
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• In addition to all these, embedded systems have to be cheap 
to produce and maintain, and, at the same time, flexible 
enough to be extended with new dedicated functions when- 
ever necessary. 

Therefore, in order to function correctly, an embedded system 
not only has to be designed such that it implements the required 
functionality, but also has to fulfill a wide range of competing 
constraints: development cost, unit cost, size, performance, 
power consumption, flexibility, time-to-prototype, time-to-mar- 
ket, maintainability, correctness, safety, etc. [Vah02] . 

As the design of such systems is becoming increasingly com- 
plex, new analysis and synthesis techniques are needed. This 
book presents several analysis and synthesis methods for dis- 
tributed real-time embedded systems that have recently been 
developed. 

1.1 A Typical Application Area: 

Automotive Electronics 

Although the techniques presented in this book can be success- 
fully used in several application areas, it is useful, for under- 
standing the embedded systems evolution and design 
challenges, to exemplify the developments in a particular area. 

If we take the example of automotive manufacturers, they 
were reluctant, until recently, to use computer controlled func- 
tions onboard vehicles. Today, this attitude has changed for sev- 
eral reasons. First, there is a constant market demand for 
increased vehicle performance, more functionality, less fuel con- 
sumption and less exhausts, all of these at lower costs. Then, 
from the manufacturers side, there is a need for shorter time-to- 
market and reduced development and manufacturing costs. 
These, combined with the advancements of semiconductor tech- 
nology, which is delivering ever increasing performance at lower 



3 




Chapter 1 



and lower costs, has led to the rapid increase in the number of 
electronically controlled functions onboard a vehicle [Kop99] . 

The amount of electronic content in an average car in 1977 
had a cost of $110. Currently, that cost is $1341, and it is 
expected that this figure will reach $1476 by the year 2005, con- 
tinuing to increase because of the introduction of sophisticated 
electronics found until now only in high-end cars (see 
Figure 1.1) [Han02], [Lee02]. It is estimated that in 2006 the 
electronics inside a car will amount to 25% of the total cost of the 
vehicle (35% for the high end models), a quarter of which will be 
due to semiconductors [JosOl], [Han02]. High-end vehicles cur- 
rently have up to 100 microprocessors implementing and con- 
trolling various parts of their functionality. The total market for 
semiconductors in vehicles is predicted to grow from $8.9 bil- 
lions in 1998 to $21 billion in 2005, amounting to 10% of the 
total worldwide semiconductors market [Han02] , [Kop99] . 

At the same time with the increased complexity, the type of 
functions implemented by embedded automotive electronics sys- 
tems has also evolved. Thanks to the semiconductors revolution, 
in the late 50s, electronic devices became small enough to be 
installed on board of vehicles. In the 60s the first analog fuel 
injection system appeared, and in the 70s analog devices for con- 
trolling transmission, carburetor, and spark advance timing 
were developed. The oil crisis of the 70s led to the demand of 
engine control devices that improved the efficiency of the engine, 
thus reducing fuel consumption. In this context, the first micro- 
processor based injection control system appeared in 1976 in the 
USA. During the 80s, more sophisticated systems began to 
appear, like electronically controlled braking systems, dash- 
boards, information and navigation systems, air conditioning 
systems, etc. In the 90s, development and improvement have 
concentrated in the areas like safety and convenience. Today, it 
is not uncommon to have highly critical functions like steering 
or braking implemented through electronic functionality only. 
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without any mechanical backup, like is the case in drive-by-wire 
and brake-by- wire systems [Chi96], [XbW98]. 

A large class of systems have tight performance and reliability 
constraints. A good example is the engine control unit, whose 
main task is to reduce the level of exhausts and the fuel con- 
sumption by controlling the air and fuel mixture in each cylin- 
der. For this, the engine controller is usually designed as a 
closed-loop control system which has as feedback the level of 
exhausts. The engine speed is the most important factor to con- 
sider with respect to the timing requirements of the engine con- 
troller. A typical 4 cylinder engine has an optimal speed of 6,000 
revolutions per minute (RPM). At 6,000 RPM the air to fuel ratio 
for each cylinder must be recomputed every 20 milliseconds 
(ms). This means that in a 4 cylinder engine a single controller of 
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Figure 1.1: Worldwide Automotive Electronics 
Trends [Han02] 
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such tjrpe must complete the entire loop in 5 ms! For such an 
engine controller, not meeting the timing constraint leads to a 
less efficient fuel consumption and more exhausts [Chi96] . How- 
ever, for other t 3 q)es of systems, like drive-by-wire or brake-by- 
wire, not fulfilling the timing requirements can have cata- 
strophic consequences. 

We have seen so far that the use of electronics in modern vehi- 
cles is increasing, replacing or augmenting critical mechanical 
and hydraulic vehicle components. Their complexity is growing 
at a very high pace, and the constraints — in terms of functional- 
ity, performance, reliability, cost and time-to-market — are get- 
ting tighter. Therefore, the task of designing such systems is 
becoming increasingly important and difficult at the same time. 
New design techniques are needed, which are able to: 

• successfully manage the complexity of embedded systems, 

• meet the constraints imposed by the application domain, 

• shorten the time-to-market, and 

• reduce development and manufacturing costs. 



1.2 Distributed Hard Real-Time 
Embedded Systems 

In this book, we are interested in a particular class of systems 
called real-time embedded systems. Very important for the cor- 
rect functioning of such systems are their timing constraints. 
For example, a vehicle cruise controller has to react within tens 
of milliseconds to events originating from the driver or road con- 
ditions. Kopetz [Kop97a] gives a definition for a real-time sys- 
tem as being “a computer system in which the correctness of the 
system behavior depends not only on the logical results of the 
computations, but also on the physical instant at which these 
results are produced.” 
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Figure 1.2; A Distributed Real-Time System Example 



Real-time systems have been classified as hard real-time and 
soft real-time systems [Kop97a]. Basically, hard real-time sys- 
tems are systems where failing to meet a timing constraint can 
potentially have catastrophic consequences. For example, a 
brake-by-wire system in a car failing to react within a given time 
interval can result in a fatal accident. On the other hand, a mul- 
timedia system, which is a soft-real time system, can, under cer- 
tain circumstances, tolerate a certain amount of delays resulting 
maybe in a patchier picture, without serious consequences 
besides some possible inconvenience to the user. 

The techniques presented in this book are aimed towards 
hard-real time systems that implement safety-critical applica- 
tions where timing constraints are of utmost importance to the 
correct behavior of the application. 

Many such applications, following physical, modularity or 
safety constraints, are implemented using distributed architec- 
tures. In Figure 1.2 we illustrate a distributed real-time system 
implementing some electronic functions of a vehicle. For exam- 
ple, the network on the left is responsible to implement function- 
ality related to the engine, while the network on the right 
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implements functions related to the powertrain, like brake-by- 
wire, anti blocking system, etc. 

Such systems are composed of several different types of hard- 
ware components (called nodes), interconnected in a network. 
An increasing number of such systems are today implemented 
as heterogeneous distributed systems, which means that they 
are composed of several networks, interconnected with each 
other. Each network has its own communication protocol and 
two such networks communicate via a gateway which is a node 
connected to both of them. This t3rpe of architectures are used in 
increasing numbers in several application areas: networks on 
chip are heterogeneous, we also see them in, for example, factory 
systems, and they are very common in vehicles. 

For such systems, the communication between the functions 
implemented on different nodes has an important impact on the 




Functions of the first application 

# Functions of the second application 

• Functions of the third application 

Figure 1.3: Distributed Safety-Critical Applications 
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overall system properties such as performance, cost, maintain- 
ability, etc. 

The application software running on such distributed archi- 
tectures is composed of several functions. The way the functions 
have been distributed on the architecture has evolved over time. 
Initially, in automotive applications, each function was running 
on a dedicated hardware node, allowing the system integrators 
to purchase nodes implementing required functions from differ- 
ent vendors, and to integrate them together into their system. 
The number of such nodes in the architecture has exploded, 
reaching more than 100 in a high-end car. This has created a 
huge pressure to reduce the number of nodes, use the resources 
available more efficiently, and thus reduce costs. 

This development has led to the need to integrate several 
functions in one node. For this to be possible, middleware soft- 
ware that abstracts away the hardware differences of the nodes 
in the heterogeneous architecture has to be available [Eas02]. 
Using such a middleware architecture, the software functions 
become independent of the particular hardware details of a 
node, and thus they can be distributed on the hardware architec- 
ture, as depicted in Figure 1.3. 

Although an application is typically distributed over one sin- 
gle network, we begin to see applications that are distributed 
across several networks, like is the case in Figure 1.3 where the 
third application, represented as black dots, is distributed over 
the two networks. This trend is driven by the need to further 
reduce costs, improve resource usage, but also by application 
constraints like having to be physically close to particular sen- 
sors and actuators. Moreover, not only are these applications 
distributed across networks, but their functions can exchange 
critical information through the gateway nodes. 

Such safety-critical hard real-time distributed applications 
running on heterogeneous distributed architectures are inher- 
ently difficult to analyze and implement. Due to their distrib- 
uted nature, the communication has to be carefully considered 
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during the analysis and design in order to guarantee that the 
timing constraints are satisfied under the competing objective of 
reducing the cost of the implementation. 



1.3 Book Overview 

We are interested in the analysis and S 5 mthesis of safety-critical 
distributed applications implemented using hard real-time 
embedded systems. 

Several methodologies have been proposed for real-time 
embedded systems design. Regardless of the chosen methodol- 
ogy, there are a number of major design tasks that have to be 
performed. 

One major design task is to decide what components to 
include in the hardware architecture and how these components 
are connected. This is called the architecture selection phase. In 
order to reduce costs, especially in the case of a mass market 
product, the hardware architecture is usually reused, with some 
modifications, for several product lines. Such a common hard- 
ware architecture is denoted by the term hardware platform, 
and consequently the design tasks related to such an approach 
are grouped under the term platform-based design [KeuOO] . 

Once a hardware platform has been fixed, the software func- 
tions have to be specified. For the specification of functionality 
we use a control and datafiow graph based representation 
[Ele98a], [EleOO] described in detail in Section 2.3.1. 

Next, the designer has to decide what part of the functionality 
should be implemented on which of the selected components (the 
mapping task) and what is the execution order of the resulting 
functions (the scheduling task). An important design task in the 
context of distributed applications is the communication synthe- 
sis task, which decides the characteristics of the communication 
infrastructure and the access constraints to the infrastructure, 
imposed on functions initiating an inter-node communication. 
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These design tasks can partially overlap, and they can be 
assisted by analysis and (semi)automatic synthesis tools. In 
addition, the design tasks have to be performed such that the 
timing constraints of hard real-time applications are satisfied, 
and the implementation costs are minimized. 

The analysis takes into account the heterogeneous intercon- 
nected nature of the architecture, and is based on an application 
model which captures both the dataflow and the flow of control. 
The s3mthesis techniques described here derive implementa- 
tions of the system that fulfill the design constraints and reduce 
the costs. An important part of the system implementation is the 
synthesis of the communication infrastructure, which has a sig- 
nificant impact on the overall system performance and cost. 

The design of real-time systems seldom starts from scratch. 
T5q)ically, designers start from an already existing system, run- 
ning certain applications, and the design problem is to imple- 
ment new functionality on this system. Moreover, after the new 
functionality has been implemented, the resulting system has to 
be structured such that additional functionality, later to be 
added, can easily be accommodated. 

Such an approach provides a high degree of flexibility during 
the design process, and thus, can result in important reductions 
of design costs. Therefore, the analysis and synthesis methods 
presented have been considered within such an incremental 
design process. 

This book is structured in four parts, and has ten chapters. In 
the first part we present the application model and the hard- 
ware and software architectures considered. The second part 
presents analysis and s5mthesis methods for time-driven sys- 
tems, where the activation of processes and transmission of mes- 
sages happen at predetermined points in time. In the third part 
we provide an analysis and develop techniques for the synthesis 
of event-driven systems, where the activation of processes is 
done at the occurrence of significant events. The fourth part 
combines time-driven and event-driven systems into heteroge- 
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neous networks, and presents analysis and S 3 mthesis methods 
for applications distributed across such networks. 

This is, briefly, what each chapter is about: 

• Part I: Preliminaries 

— Chapter 2 (System-Level Design and Modeling) presents 
the design methodologies commonly used in embedded 
systems design, with an emphasis on function/architec- 
ture co-design. We introduce the application model con- 
sidered, a control and dataflow graph based 
representation [Ele98a], [EleOO] called conditional pro- 
cess graph, as well as a model for characterizing existing 
and future applications within an incremental design 
process. 

— Chapter 3 (Distributed Hard Real-Time Systems) pre- 
sents the time-driven and event-driven approaches to the 
design of real-time systems and introduces the non-pre- 
emptive static cyclic scheduling and fixed priority pre- 
emptive scheduling policies. We also present the 
hardware and software architectures considered, includ- 
ing the details of the communication protocols used: the 
time triggered protocol (TTP) [Kop03], which is a time- 
driven protocol based on a time-division multiple access 
(TDMA) bus access scheme, and the controller area net- 
work protocol (can) [Bos91], an event-driven communica- 
tion protocol employing a collision avoidance scheme. 

• Part II: Time-Driven Systems 

— Chapter 4 (Scheduling and Bus Access Optimization for 
Time-Driven Systems) considers a non-preemptive static 
scheduling approach for both processes and messages. In 
such a context, we discuss the static cyclic scheduling of 
systems with data and control dependencies. The pre- 
sented technique is then extended to handle the schedul- 
ing of messages over a communication channel using the 
time-triggered protocol. Several approaches to the syn- 
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thesis of communication parameters of a TDMA bus are 
proposed. 

— Chapter 5 (Incremental Mapping for Time-Driven Sys- 
tems) investigates the mapping and scheduling tasks in 
the context of an incremental design approach. Such a 
design process satisfies two main requirements when 
adding new functionality: the already running applica- 
tions are disturbed as little as possible, and there is a 
good chance that, later, new functionality can easily be 
mapped on the resulted system. 

Part III: Event-Driven Systems 

— Chapter 6 (Schedulability Analysis and Bus Access Opti- 
mization for Event-Driven Systems) assumes a preemp- 
tive fixed priority scheduling approach for the processes 
and a non-preemptive static cyclic scheduling approach 
for the messages, based on the TTP. A schedulability anal- 
ysis is developed considering four message scheduling 
approaches for TTP. In addition, we show how, by consid- 
ering both data and control dependencies when modeling 
an application, it is possible to reduce the pessimism of 
the analysis. Several optimization strategies that derive 
the parameters of the communication protocol are also 
presented. 

— Chapter 7 (Incremental Mapping for Event-Driven Sys- 
tems) addresses the same mapping and scheduling prob- 
lems inside an incremental design process as discussed in 
Chapter 5, but this time in the context of the architec- 
tures and event-driven scheduling policies considered in 
Chapter 6. 

Part IV: Multi-Cluster Systems 

— Chapter 8 (Schedulability Analysis and Bus Access Opti- 
mization for Multi-Cluster Systems) introduces the con- 
cept of multi-cluster systems, which are heterogeneous 
networks composed of several networks (called clusters), 
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interconnected via gateways. In this chapter we consider 
a two-cluster configuration composed of a time-driven 
cluster and an event-driven cluster. We propose a schedu- 
lability analysis technique for such systems, which also 
produces bounds on the communication buffer sizes 
required by an application to fulfill its timing constraints. 
Optimization strategies are developed, aiming at synthe- 
sizing a communication infrastructure that would guar- 
antee the timing constraints of the application at the 
same time with minimizing the system implementation 
costs. 

— Chapter 9 (Partitioning and Mapping for Multi-Cluster 
Systems) addresses design problems which are character- 
istic to multi-clusters: partitioning of the system func- 
tionality into time-triggered and event-triggered 
domains, and mapping of processes to the heterogeneous 
nodes of a cluster. We present several heuristics for solv- 
ing these problems, and show that they are able to find 
schedulable implementations under limited resources, 
achieving an efficient utilization of the system. 

— Chapter 10 (Schedulability-Driven Frame Packing for 
Multi-Cluster Systems) addresses the issue of packing of 
messages to frames in the case of TTP and CAN protocols. 
We have updated the schedulability analysis presented in 
Chapter 8 to take into account the details related to 
frames, and we will discuss two optimization heuristics 
that use the schedulability analysis as a driver towards a 
frame configuration that leads to a schedulable system. 

All the approaches presented in the book have been evaluated 
using an extensive set of applications generated for experimen- 
tal purposes. In addition, we also use throughout the book a 
real-life case study, a vehicle cruise controller, in order to illus- 
trate the relevance of the identified problems and to further 
evaluate the proposed approaches. 
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The modeling and design of embedded systems can be per- 
formed at several abstraction levels. Gajsky [Gaj83] identifies the 
following abstraction levels in the context of CAD tools for VLSI: 

• Circuit level is the lowest level of abstraction. For example, 
the hardware at this level is seen as transistors, capacitors, 
resistors, etc., and differential equations are often used to 
describe their functionality. 

• Logic level is next towards higher levels of abstraction. Here, 
the functionality is represented as boolean logic (hence the 
name, logic level), implemented in hardware using logic 
gates and flip-flops. 

• At the register-transfer level the functionality is captured in 
terms of register-transfer operations on ALUs, registers, mul- 
tiplexers, etc. 

• The highest level of abstraction is the system level, where the 
functionality is described using “system-level specification 
formalisms” (in the case of VLSI design, these can be descrip- 
tion languages like VHDL, Verilog or SystemC) and the archi- 
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tecture is seen as building blocks consisting of processors, 
memories, etc., interconnected using buses. 

The research presented in this book is dealing with the design 
issues at the system level of abstraction. Providing methodolo- 
gies, techniques and tools for system-level design is the only 
solution to the increasing complexity of embedded systems, and 
the designer’s productivity gap [Sem02]. The system-level 
design methodology is presented in the next section. In our 
research, we place a special emphasis on an incremental design 
process as outlined in Section 2.2. 

At system level, we view the architectures as a set of heteroge- 
neous interconnected networks, each network consisting of 
nodes sharing the same communication channel. Each node has 
a processor, a memory, a communication controller, and l/Os to 
sensors and actuators. Also, the functionality is captured as a 
set of interacting processes, modeled using a process network 
formalism. The exact representation we use for modeling the 
functionality at the system level is described in Section 2.3, 
while the architectures considered are described in more detail 
in Chapter 3. 



2 . 1 System-Level Design 

The aim of a design methodology is to coordinate the design 
tasks such that the time-to-market is minimized, the design con- 
straints are satisfied, and various parameters are optimized. 
System-level design is illustrated in Figure 2.1 (adapted from 
[Ele02]). It emphasizes the design tasks that happen before the 
hardware and software components are definitively identified. 
According to the figure, which groups the system-level design 
tasks and models inside the grey rectangle, system-level design 
tasks take as input the specification of the system and its 
requirements, and produce the hardware and software models, 
which are later synthesized. 
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Figure 2.1: System-Level Design 
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In the next sections we discuss some approaches to system- 
level design, namely: 

1. Traditional design methodology, 

2. Hardware/software co-design, and 

3. Function/architecture co-design. 

2.1.1 Traditional Design Methodology 

This methodology is not a design methodology per se, but a set of 
design approaches traditionally used in the industry. 

Many organizations, including automotive manufacturers, are 
used to designing and developing their systems following some 
version of the waterfall model of system development [WolOl]. 
This means that the design process starts with a specification and, 
based on this, several system-level design tasks are performed 
manually, usually in an ad-hoc fashion. Then, the hardware and 
software parts are developed independently, often by different 
teams located far away from each other. Software code is written, 
the hardware is S3mthesized and they are supposed to integrate 
correctly. Simulation and testing are done separately on hardware 
and software, respectively, with very few integration tests. 

If this design approach was appropriate when used for rela- 
tively small systems produced in a well defined production 
chain, it performs poorly for more complex systems, leading to 
an increase in the time-to-market. There are several reasons for 
this. First of all, it is very difficult, just based on the specifica- 
tion, to accurately determine what system architecture is appro- 
priate and how the resources should be used. Also, a separate 
view on the hardware and software design processes (which are 
dependent on each other) leads to a poorly designed system, 
which often has a poor performance because of the incomplete 
exploration of the trade-offs between the software and hardware 
domains. 

New design methodologies are needed in order to cope with 
the increasing complexity of current systems. 
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2.1.2 Hakdware/Softwaee Co-Design 

The main idea behind hardware ! software co-design is to concur- 
rently design (hence the term co-design) and develop both the 
hardware and the software components of the system. Surveys 
about hardware/software co-design can be found in [Mic96], 
[Mic97], [Ern98], [Gaj95], [Sta97], [Wol94], [WolOS]. 

At the beginning, researchers proposing co-design approaches 
made quite restrictive assumptions, and the goals were modest. 
These approaches are not really system level, they actually 
belong to the “other lower-level design tasks” cloud in 
Figure 2.1. For example, several researchers have assumed a 
simple input specification in form of a computer program, and 
the main goal was to obtain an as high as possible execution per- 
formance within a given cost budget (acceleration). The architec- 
ture considered consisted of a single processor together with an 
ASIC used to accelerate parts of the functionality [Cho95a], 
[Gup95], [Moo97]. In this context, the main problems were to 
divide the functionality between the ASIC and the CPU (hard- 
ware/software partitioning) [Axe96], [Ele97], [Ern93], [Gup93], 
[Vah94], to automatically generate drivers and other compo- 
nents related to communication (communication synthesis) 
[Cho92], [Wal94] and to simulate and verify the resulting sys- 
tem (co-simulation and co-verification) [Val95] , [Val96] . 

However, today such restrictive assumptions are no longer 
valid and the goals are much broader [Bol97], [Dav98], [Dav99], 
[Dic98], [Lak99], [Ver96]: 

• The system specification is now assumed to be inherently 
heterogeneous and complex. Several languages as well as 
several models of computation can be found within a specifi- 
cation. 

• The architectures are varied, ranging from distributed 
embedded systems, in the automotive electronics area, to 
systems on a chip used in telecommunications. The hard- 
ware architectures are heterogeneous, consisting of not only 
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programmable processors and ASICs, but also DSPs, FPGAs, 
ASiPs, etc. 

• The goals include not only acceleration with minimal hard- 
ware cost, but also issues related to the reuse of legacy hard- 
ware and software subsystems, real-time constraints, quality 
of service, fault tolerance and dependability, power consump- 
tion, flexibility, time-to-market, etc. 

2.1.3 Function/Architecture Co-design 

Several researchers [TabOO] , [Lav99] have pointed out that most 
of the hardware/software co-design approaches are not really 
addressing the design tasks at system-level, but are rather 
emphasizing on the interaction between the hardware and soft- 
ware entities (the “other lower-level design tasks” in Figure 2.1). 

For this reason, a function / architecture co-design methodol- 
ogy has been proposed [Kie97], [Lie99], [TabOO], [Lav99], 
[Bal97], which addresses also the design process before hard- 
ware/software partitioning. This move towards even higher 
abstraction levels has been considered as the key to shortening 
design time and coping with complexity. 

Function/architecture co-design uses a top-down synthesis 
approach, where trade-offs are evaluated at a high level of 
abstraction (see Figure 2.2, adapted from [Ele02]). The main 
characteristic of this methodology is the use, at the same time 
with the top-down synthesis, of a bottom-up evaluation of design 
alternatives, without the need to perform a full synthesis of the 
design. The approach to obtaining accurate evaluations is to use 
an accurate modeling of the behavior and architecture (the 
“Mapped and scheduled model” box in Figure 2.2), and to 
develop analysis techniques that are able to derive estimates 
and to formally verify properties relative to a certain design 
alternative (the “Estimation” and “Simulation and verification” 
boxes). The determined estimates and properties, together with 
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Figure 2.2; Function/Architecture Co-design 
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user-specified constraints, are then used to drive the synthesis 
process. 

Thus, several architectures are evaluated to determine if they 
are suited for the specified system functionality. There are two 
extremes in the degrees of freedom available for choosing an 
architecture. At one end, the architecture is already given, and 
no modifications are possible. At the other end of the spectrum, 
no constraints are imposed on the architecture selection, and the 
synthesis task has to determine, from scratch, the best architec- 
ture for the required functionality. These two situations are, 
however, not common in practice. Usually, a hardware platform 
is available, which can be parameterized (e.g., size of memory, 
speed of the buses, etc.). In this case, the synthesis task is to 
derive the parameters of the architecture such that the function- 
ality of the system is successfully implemented. Once an archi- 
tecture is determined and/or parameterized, the function/ 
architecture co-design continues with the mapping of functional- 
ity onto the instantiated architecture. 

2.1.4 Platform-Based Design 

As the complexity of the systems continues to increase, the 
development time lengthens dramatically, and the manufactur- 
ing costs become prohibitively high. To cope with this complex- 
ity, it is necessary to reuse as much as possible at all levels of the 
design process, and to work at higher and higher abstraction 
levels. 

One of the most important components of any system design 
methodology is the definition of a system platform. Such a plat- 
form consists of a hardware infrastructure together with soft- 
ware components that will be used for several product versions, 
and will be shared with other product lines, in the hope to 
reduce costs and the time-to-market. 

The authors in [KeuOO] have proposed techniques for deriving 
such a platform for a given family of applications. Their 
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Figure 2.3: Platform-Based Design 



approach can be used within any design methodology for deter- 
mining a system platform that later on can be parameterized 
and instantiated to a desired hardware architecture. 

Considering a given application or family of applications, the 
hardware platform has to be instantiated, deciding on certain 
parameters, and lower level details, in order to suit that partic- 
ular application(s). In Figure 2.3 (adapted from [Kie97]), the 
search for an architecture instance starts from a certain plat- 
form, and a given application. The application is mapped and 
compiled on an architecture instance, and the performance 
numbers are derived, typically using simulation. If the designer 
is satisfied with the performance of the instantiated architec- 
ture, the loop ends. 
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The research presented in this book concentrates on the fol- 
lowing system-level design tasks: 

1. mapping, 

2. scheduling, and 

3. communication synthesis. 

In addition, we consider these design tasks within an incre- 
mental design process, as outlined in the next section. 



2 . 2 Incremental Design Process 

A characteristic of the majority of research efforts concerning 
the design of embedded systems is that the authors concentrate 
on the design, from scratch, of a new system optimized for a par- 
ticular application. For many application areas, however, such a 
situation is extremely uncommon and only rarely appears in 
design practice. It is much more likely that one has to start from 
an already existing system running a certain application and 
the design problem is to implement new functionality (including 
also upgrades to the existing one) on this system. In such a con- 
text it is very important to operate no, or as few as possible, mod- 
ifications to the already running application. The main reason 
for this is to avoid unnecessarily large design and testing times. 
Performing modifications on the (potentially large) existing 
application increases design time and, even more, testing time 
(instead of only testing the newly implemented functionality, the 
old application, or at least a part of it, has also to be retested). 

However, this is not the only aspect to be considered. Such an 
incremental design process, in which a design is periodically 
upgraded with new features, is going through several iterations. 
Therefore, after new functionality has been introduced, the 
resulting system has to be implemented such that additional 
functionality, later to be mapped, can easily be accommodated. 
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We illustrate such an incremental design process in 
Figure 2.4. The product is implemented as a three processor sys- 
tem and its version N-1 consists of the set \i/of two applications 
(the processes belonging to these applications are represented as 
white and black disks, respectively). At the current moment, 
application is to be added to the system, resulting in ver- 

sion N of the product. However, a new version, N-t-1, is very 
likely to follow and this fact is to be considered during imple- 
mentation 

If it is not possible to map and schedule without 

modif 5 dng the already running applications, we have to change 
the scheduling and mapping of some applications in \\i However, 
even with serious modifications performed on \\^ it is still 
possible that certain constraints are not satisfied. In this case 
the hardware architecture has to be changed by, for example, 
adding a new processor, and the mapping and scheduling 
procedure for has to be restarted. In this book we do not 

elaborate on the aspect of adding new resources to the 
architecture, but will concentrate on the mapping and 
scheduling aspects. Therefore, we will consider that a possible 
mapping and scheduling of which satisfies the imposed 

constraints can be found (with minimizing the modification of 
the already running applications), and this solution has to be 
further improved in order to facilitate the implementation of 
future applications. 



1. The design process outlined here also applies when is a new ver- 

sion of an application e \|/ In this case, all the processes and com- 
munications belonging to Tgi^ are eliminated from the running system 
\)/ before starting the mapping and scheduling of 



25 




Chapter 2 






26 



Figure 2.4: Incremental Design Process 
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2.3 Application Modeling 

The functionality of the host system, into which the electronic 
system is embedded, is normally described using a formalism 
from that particular domain of application. For example, if the 
host system is a vehicle, then its functionality is described in 
terms of control algorithms using differential equations, which 
are modeling the behavior of the vehicle and its environment. At 
the level of the embedded system which controls the host sys- 
tem, viewed as the system level for us, the functionality is t 5 q>i- 
cally described as a set of functions, accepting certain inputs and 
producing some output values. 

There is a lot of research in the area of system modeling and 
specification, and an impressive number of representations have 
been proposed. An overview, classification and comparison of dif- 
ferent design representations and modeling approaches is given 
in [Edw97], [EdwOO], [Lav99]. 

The scheduling and mapping design tasks deal with sets of 
interacting processes. A process is a sequence of computations 
(corresponding to several building blocks in a programming lan- 
guage) which starts when all its inputs are available. When it 
finishes executing, the process produces its output values. 
Researchers have used, for example, dataflow process networks 
(also called task graphs, or process graphs) [Lee95] to describe 
interacting processes, and have represented them using directed 
acyclic graphs, where a node is a process and the directed arcs 
are dependencies between processes. 

One drawback of dataflow process graphs is that they are not 
suitable to capture the control aspects of an application. For 
example, it can happen that the execution of some processes can 
also depend on conditions computed by previously executed pro- 
cesses. By explicitly capturing the control flow in the model, a 
more fine-tuned modeling and a tighter (less pessimistic) assign- 
ment of execution times to processes is possible, compared to 
traditional data-flow based approaches. Several researchers 
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have proposed extensions to the dataflow process graph model in 
order to capture these control dependencies [Ele98a], [Thi99], 
[KlaOl]. 

In this book we use the conditional process graph (CPG) 
[Ele98a], [EleOO] as an abstract model for representing the 
behavior of the application, as it not only captures both dataflow 
and the flow of control, but is also suitable for handling timing 
aspects. 

2.3.1 Conditional Process Graph 

We model an application F as a set of conditional process graphs 

€ r. K conditional process graph is an abstract representation 
consisting of a directed, acyclic, polar graph (^V, Eg, Eq). Each 
node Pi € V represents one process. Eg and Eq are the sets of 
simple and conditional edges, respectively. Eg n Eq = 0 and 
Eg u Eq = E, where E is the set of all edges. An edge e^ s E from 
Pj to Pj indicates that the output of Pj is the input of Pj. 

The graph is polar, which means that there are two nodes, 
called source and sink, that conventionally represent the first 
and last process. These nodes are introduced as dummy pro- 
cesses, with zero execution time and no resources assigned, so 
that all other nodes in the graph are successors of the source and 
predecessors of the sink respectively. 

A mapped conditional process graph, G{V*, E§, E^, M), is gen- 
erated from a conditional process graph gPV, Eg, Eq) by inserting 
additional processes (communication processes) on certain edges 
and by mapping each process to a given processing element. The 
mapping of processes Pj e V* to processors and buses is given by 
a function M: V* ~^PE, where PE = {pe^, pe 2 , ..., } is the set 

p6 

of processing elements. PE = PP u B, where PP is the set of pro- 
grammable processors and B is the set of allocated buses. For 
every process Pj, MiPi) is the processing element to which Pj is 
assigned for execution. 
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In the process graph depicted in Figure 2.5, Pq and are the 
source and sink nodes, respectively. The nodes denoted Pi, P 2 , ■■■, 
Pi 4 are “ordinary” processes specified by the designer. They are 
assigned to one of the three programmable processors, as indi- 
cated by the shading in the figure. The rest of the nodes are so 
called communication processes and they are represented in 
Figure 2.5 as solid circles. They are introduced during the gener- 
ation of the system representation for each connection which 
links processes mapped to different processors, and model inter- 
processor communication. All communications in Figure 2.5 are 
performed on one bus. 
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Figure 2.5: A Conditional Process Graph Example 
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An edge e Eq is a conditional edge (represented with thick 
lines in Figure 2.5) and has an associated condition value. In 
Figure 2.5 processes and Py have conditional edges at their 
output. 

We call a node with conditional edges at its output a disjunction 
node (and the corresponding process a disjunction process). A dis- 
junction process has one associated condition, the value of which 
it computes. Alternative paths starting from a disjunction node, 
which correspond to complementary values of the condition, are 
disjoint and they meet in a so called conjunction node (with the 
corresponding process called conjunction process)^. In Figure 2.5, 
circles representing conjunction and disjunction nodes are 
depicted with thick borders. The alternative paths starting from 
disjimction node P^, which computes condition C, meet in con- 
junction node P 5 . We assume that conditions are independent 
and alternatives starting from different processes cannot 
depend on the same condition. 

Execution Semantic 

The conditional process graph has the following execution 
semantic: 

• A process, that is not a conjunction process, can be activated 
only after all its inputs have arrived. 

• A conjunction process can be activated after messages com- 
ing on one of the alternative paths have arrived. 

• All processes issue their outputs when they terminate. 

• A boolean expression Xp., called a guard, can be associated to 
each node Pj in the graph. It represents the necessary condi- 
tions for the respective process to be activated. Xp. is not only 
necessary but also sufficient for process Pj to be activated 



1. If no process is specified on an alternative path, it is modeled by a con- 
ditional edge from the disjunction to the corresponding conjunction node 
(a communication process may be inserted on this edge at mapping). 
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during a given system execution. Thus, two nodes and Pj, 
where Pj is not a conjunction node, are connected by an edge 
Cy only if Xp. ~^Xp. (which means that Xp. is true whenever 
Xp. is true). This avoids specifications in which a process is 
blocked even if its guard is true, because it waits for a mes- 
sage from a process which will not be activated. If Pj is a con- 
junction node, predecessor nodes Pj can be situated on 
alternative paths corresponding to a condition. 

• Transmission on conditional edges takes place only if the 
associated condition value is true and not, like on simple 
edges, for each activation of the input process Pj. 

• We consider two possible execution environments for pro- 
cesses: non pre-emptive and pre-emptive: 

— In a non-preemptive environment a process cannot be 
interrupted during its execution. 

— In a preemptive execution environment a higher priority 
processes can interrupt the execution of lower priority 
processes. Also, under certain circumstances, a lower pri- 
ority process can block a higher priority process (e.g., it is 
in its critical section), and we consider that the blocking 
time is computed using the analysis from [Sha90] for the 
priority ceiling protocol. 

The above execution semantic is that of a so called single rate 
system. It assumes that a node is executed at most once for each 
activation of the system. If communicating processes with differ- 
ent periods have to be handled (in which case we consider that 
each process Pj has an associated period Pj), this can be solved 
by generating several instances of the processes and building a 
CPG which corresponds to a set of processes as they occur within 
a time period that is equal to the least common multiple of the 
periods of the involved processes (see Figure 5.5 on page 111). 

Throughout the book we will assume, without loss of general- 
ity, that all processes and messages belonging to a process graph 
Gj have the same period Pj = Pq. which is the period of the pro- 
cess graph. 
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Specifying Timing Information 

Let 9{p. be the set of processing elements to which can poten- 
tially be mapped. For each processing element pej e 9^., we 
know the worst-case execution time Cf^J of process Pj, when exe- 
cuted on pCj. When the mapping of a process P^ is clear from the 
context, we use the term Cj to denote its worst-case execution 
time. In addition, each process Pj is characterized by a period Tj 
and a priority priority p\ 

The communication processes (messages), modeling inter-pro- 
cessor communication, have an associated execution time C, , 
(where P^ is the sender and Pj the receiver process) equal to the 
corresponding transmission time. For each message m we know 
its size Sj^. A message is sent once in every invocations of the 
sending process, with a period = n^^T^ inherited from the 
sender process Pj. In the case of an event-driven communication 
protocol (e.g., CAN) messages also have an associated priority, 
priority^. 

As mentioned, we consider execution times of processes, as 
well as the communication times, to be given. In Figure 2.5 they 
are depicted to the right of each node. In the case of hard real- 
time systems this will, typically, be worst case execution times 
and their estimation has been extensively discussed in the liter- 
ature [Eng99], [Em97], [Li95], [Lun99], [Mal97], [Wol02]. 

If we consider the activation time of the source process as a ref- 
erence, the activation time of the sink process is the delay of the 
system at a certain execution. This delay has to be, in the worst 
case, smaller than a certain imposed deadline Dq. on the process 
graph Gj. Throughout the book we assume that the deadline can 
be larger than the period. 

Deadlines can also be placed locally on processes. Release 
times of some processes as well as multiple deadlines can be eas- 



1. In the case of a static cyclic scheduling environment no priority has to 
be attached to the process. 
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ily modeled by inserting dummy nodes between certain processes 
and the source or the sink node respectively. These dummy 
nodes represent processes with a certain execution time but 
which are not allocated to any processing element. 

2.3.2 Incremental Design: Modeling the Already 
Implemented Applications 

Let us consider an incremental design process as outlined in 
Section 2.2 (Figure 2.4). If the initial attempt to schedule and 
map does not succeed, we have to modify the schedule 

and, possibly, the mapping of some already running applica- 
tions, belonging to \i/ in the hope to find a valid solution for 
r 

^ current* 

The goal is to find that minimal modification to the existing 
system which leads to a correct implementation of In our 

context, such a minimal modification means remapping and/or 
rescheduling a subset Q of the old applications, Q e \|< so that 
the total cost of re-implementing Qis minimized. 

Remapping and/or rescheduling a certain application e \|i 
can trigger the need to also perform modifications to one or sev- 
eral other applications because of, for example, the dependencies 
between processes belonging to these applications. In order to 
capture such dependencies between the applications in \\f as 
well as their modification costs, we have introduced a represen- 
tation called the application graph. 

We represent a set of applications as a directed acyclic graph 
T.), where each node F j € “F represents an application. An 
edge e^j € T, from F j to F^ indicates that any modification to F ^ 
would trigger the need to also remap and/or reschedule F^, 
because of certain interactions between the applications^. Each 
application in the graph has an associated attribute specif5dng if 



1. If a set of applications have a circular dependence, such that the modi- 
fication of any one implies the remapping of all the others in that set, 
the set will be represented as a single node in the graph. 
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that particular application is allowed to be modified or not (in 
the last case, it is called “frozen”). To those nodes Fj € ‘Frepre- 
senting modifiable applications, the designer has associated a 
cost Ry . of re-implementing F ^ Given a subset of applications Qe 
\K the total cost of modif 3 dng the applications in Gis: 

RiO) = X ^F, (2.1) 

r, e Q 

Modifications of an already running application can only be 
performed if the process graphs corresponding to that applica- 
tion, as well as the related deadlines (which have to be satisfied 
also after re-mapping and re-scheduling), are available. How- 
ever, this is not always the case, and in such situations that par- 
ticular application has to be considered frozen. 

Example 2.1: In Figure 2.6 we present the application 
graph corresponding to a set of ten applications. Applications 
Fg, Fg, Fg and F ^q, depicted in black, are frozen: no modifica- 
tions are possible to them. The rest of the applications have 
the modification cost R^. depicted on their left. For example, 
F 7 can be remapped with a cost of 20. If F 4 is to be re-imple- 
mented, this also requires the modification of F 7 , with a total 
cost of 90. In the case of F g, although not frozen, no remap- 
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Figure 2.6: Modeling Already Implemented Applications 
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ping or rescheduling is possible as it would trigger the need 
to modify T q, which is frozen. 

■ 

As mentioned before, to each application F j € ‘L' the designer 
has associated a cost . of re-implementing F j. Such a cost can 
typically be expressed in man-hours needed to perform retesting 
of Fj and other tasks connected to the remapping and reschedul- 
ing of the application. 

If an application is remapped or rescheduled, it has to be vali- 
dated again. Such a validation phase is very time consuming. In 
the automotive industry, for example, the time-to-market in the 
case of the powertrain unit is 24 months. Out of these, 5 months, 
representing more than 20%, are dedicated to validation. In the 
case of the telematic unit, the time to market is less than one 
year, while the validation time is two months [San03] . However, 
if an application is not modified during implementation of new 
functionality, only a small part of the validation tasks have to be 
re-performed (e.g., integration testing), thus reducing signifi- 
cantly the time-to-market, at no additional hardware or develop- 
ment cost. 

How to concretely perform the estimation of the modification 
cost related to an application is beyond the topics of this book. 
Several approaches to cost estimation for different phases of the 
software life-cycle have been elaborated and are available in the 
literature [Deb97], [Rag02]. One of the most influential software 
cost models is the Constructive Cost Model (COCOMO) [BoeOO]. 
COCOMO is at the core of tools such as REVIC [REV94] and its 
newer version SoftEST [Sof97], which can produce cost estima- 
tions not only for the total development but also for testing, inte- 
gration, or modification related retesting of embedded software. 
The results of such estimations can be used by the designer as 
the cost metrics assigned to the nodes of an application graph. 

In general, it can be the case that several alternative costs are 
associated to a certain application, depending on the particular 
modification performed. Thus, for example, we can have a cer- 
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tain cost if processes are only rescheduled, and another one if 
they are also remapped on an alternative node. For different 
modification alternatives considered during design space explo- 
ration, the corresponding modification cost has to be selected. In 
order to keep the discussion reasonably simple, throughout the 
book we will consider the case with one single modification cost 
associated to an application. However, the generalization for 
several alternative modification costs is straightforward. 



2.3.3 Cruise Controller Example 

A typical safety critical application with hard real-time con- 
straints, to be implemented on a distributed architecture, is a 
vehicle cruise controller (CC). 

The CC described in this specification delivers the following 
functionality: 

• It maintains a constant speed for speeds over 35 km/h and 
under 200 km/h. 

• It offers an interface (buttons) to increase or decrease the ref- 
erence speed. 

• It is able to resume its operation at the previous reference 
speed. 

•The CC operation is suspended when the driver presses the 
brake pedal. 

It is assumed that the CC will operate in a distributed environ- 
ment consisting of several interconnected nodes. There are five 
nodes which functionally interact with the CC system: the Anti- 
lock Braking System (ABS), the Transmission Control Module 
(tcm), the Engine Control Module (ECM), the Electronic Throttle 
Module (ETM), and the Central Electronic Module (CEM). 

We have considered two hardware architectures for the imple- 
mentation of the cruise controller, presented in Figure 2.7. In 
Figure 2.7a, all nodes are connected to a TTP bus, while in 
Figure 2.7b, we have a two cluster system. In Figure 2.7b, the 
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TTP bus 



a) Hardware architecture: five nodes connected by a TTP bus 



TT Cluster 




ET Cluster 



b) Hardware architecture: a two cluster system 



Figure 2.7: Two Hardware Architectures for 
the Cruise Controller 
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ABS and TCM nodes are part of the time-triggered cluster, that 
uses the TTP as the communication protocol, and the ECM and 
ETM nodes are on the event-triggered cluster, which uses CAN. 
The CEM node is the gateway, connected to both networks. 

We have modeled the specification of the CC system using a 
conditional process graph that consists of 32 processes, and 
includes two conditions. The first condition, calculated by the 
source node, decides if the CC is in operation or not (ON or OFF), 
while the second condition is used to decide, in the case the CC is 
operational, if the car should speedup or break when trying to 
reach the reference speed. 

The model is presented in Figure 2.8 without assuming any 
mapping. However, when discussing the scheduling tasks 
addressed in this book, the mapping is considered as already 
given. For those cases, we will use the mapping depicted in 
Figure 2.9. In addition to the nodes representing processes, in 
Figure 2.9 we have also introduced nodes representing the com- 
munication between processes mapped on different processors, 
depicted with black dots (see Section 2.3.1). The worst-case exe- 
cution times, considering the mapping in the figure, are depicted 
to the right of each process. The message sizes are depicted to 
the left of each message. 

The cruise controller example presented in this section will be 
used in the following chapters for evaluating different 
approaches to the discussed problems. 
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Depending on the particular application, real-time systems 
can be implemented as uniprocessor, multiprocessor, or distrib- 
uted systems. These systems can be hard or soft, event-driven or 
time-driven, fault-tolerant, autonomous, etc. A good classifica- 
tion of real-time systems is given in [Kop97a]. 

Our discussion in this book concentrates on safety-critical 
hard real-time applications implemented on distributed plat- 
forms, where communication has an important impact on the 
global functionality. 

This chapter describes the hardware and software architec- 
tures we consider for the implementation of a distributed real- 
time system. The general hardware platform is introduced in 
Section 3.2. We particularize this platform for time-driven sys- 
tems in Section 3.3, present event-driven systems in Section 3.4, 
and show how the two can be combined into multi-cluster sys- 
tems in the last section. 
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3 . 1 Time-Triggered vs. Event-Triggered 

According to [Kop97a] a trigger is “an event that causes the start 
of some action, e.g, the execution of a task or the transmission of 
a message.” Two different approaches to the design of real-time 
systems can be identified, based on the triggering mechanisms 
for the processing and communication: 

• Time-Triggered (TT) 

In the time-triggered approach activities are initiated at pre- 
determined points in time. In a distributed time-triggered 
system it is assumed that the clocks of all nodes are synchro- 
nized to provide a global notion of time. Time-triggered sys- 
tems are typically implemented using non-preemptiue static 
cyclic scheduling, where the process activation or message 
communication is done based on a schedule table built off- 
line. 

• Event-Triggered (ET) 

In the event-triggered approach activities happen when a 
significant change of state occurs. Event-triggered systems 
are typically implemented using preemptive priority -based 
scheduling, where, as response to an event, the appropriate 
process is invoked to service it. 

There has been a long debate in the real-time and embedded 
systems communities concerning the advantages of each 
approach [Aud93], [Kop97a], PCu93]. Several aspects have been 
considered in favour of one or the other approach, such as flexi- 
bility, predictability, jitter control, processor utilization, and 
testability. 

An interesting comparison of the ET and TT approaches, from a 
more industrial, in particular automotive, perspective, can be 
found in [L6n99] . The conclusion there is that one has to choose 
the right approach, depending on the particularities of the appli- 
cation. 
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The analysis and synthesis techniques presented in this book 
are applied to the following three t 5 q)es of systems. The hard- 
ware architectures for these three types of systems are particu- 
larizations of the generic hardware platform presented in the 
next section. 

1. Time-Driven Systems 

In time-driven systems processes are time-triggered. The de- 
tails of the hardware and software architectures for time- 
driven systems are presented in Section 3.3. The mapping, 
scheduling and communication s 3 mthesis methods for time- 
driven applications are presented in Part II of this book. 

2. Event-Driven Systems 

In this t 3 rpe of systems processes are event-triggered. The 
hardware and software architectures for event-driven sys- 
tems are detailed in Section 3.4. Part III presents our analy- 
sis and synthesis methods for event-driven systems. 

3. Multi-Cluster Systems 

The systems presented at points one and two are either TT or 
ET. However, for certain applications, the two approaches 
can be used together, some processes being TT and others ET. 
Moreover, efficient implementation of new, highly sophisti- 
cated automotive applications, entails the use of TT process 
sets together with ET ones implemented on top of complex 
distributed architectures. 

One approach to the design of such systems is to allow ET 
and TT processes to share the same processor as well as stat- 
ic (TT) and dynamic (ET) communications to share the same 
bus. Bus sharing of TT and ET messages is supported by pro- 
tocols which support both static and dynamic communication 
[Fle02] . Traian Pop et al. [Pop02b] , [Pop02c] have addressed 
the problem of timing analysis and design optimization for 
such systems. 

In this book, we consider systems designed as intercon- 
nected clusters of processors. Each such cluster can be either 
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TT or ET. Depending on their particular nature, certain parts 
of an application can be mapped on processors belonging to 
an ET cluster or a TT cluster. 

The hardware and software architectures for such multi- 
cluster systems are presented in Section 3.5. The analysis 
and synthesis methods for multi-cluster systems are out- 
lined in Part IV of the book. 



3.2 The Hardware Platform 

We consider, in the most general case, a hardware platform com- 
posed of several networks, interconnected with each other (see 
Figure 3.1). Each network has its own communication protocol, 
and internetwork communication is via a gateway which is a 
node connected to both networks. The architecture can contain 
several such networks, having different types of topologies. 

A network is composed of several different types of hardware 
components, called nodes. Every node consists of a communica- 
tion controller, a CPU, a RAM, a ROM and an l/o interface to sen- 



Sensors/Actuators 




Figure 3.1: Distributed Hard Real-Time Systems 
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sors and actuators. A node can also have an ASIC in order to 
accelerate parts of its functionality. The communication control- 
lers implement the protocol services, and run independently of 
the node’s CPU. 

The microcontrollers used in a node and the t 3 T)e of network 
protocol employed are influenced by the nature of the function- 
ality and the imposed real-time, fault-tolerance and power con- 
straints. In the automotive electronics area the functionality is 
typically divided in two classes, depending on the level of criti- 
calness: 

• Body electronics refers to the functionality that controls sim- 
ple devices such as the lights, the mirrors, the windows, the 
dashboard. The constraints of the body electronic functions 
are determined by the reaction time of the human operator 
that is in the range of 100 ms to 200 ms. A typical body elec- 
tronics system within a vehicle consists of a network of ten to 
twenty nodes that are interconnected by a low bandwidth 
communication network like LIN. A node is usually imple- 
mented using a single-chip 8 bit micro-controller (e.g.. Motor- 
ola 68HC05 or Motorola 68HC11) with some hundred b 3 des 
of RAM and Kilobytes of ROM, I/O points to connect sensors 
and to control actuators, and a simple network interface. 
Moreover, the memory size is growing by more than 25% 
each year [Kop99] , [Chi96] . 

• System Electronics are concerned with the control of vehicle 
functions that are related to the movement of the vehicle. 
Examples of system electronics applications are engine con- 
trol, braking, suspension, vehicle dynamics control. The tim- 
ing constraints of system electronic functions are in the 
range of a couple of ms to 20 ms, requiring 16-bit or 32-bit 
microcontrollers (e.g.. Motorola 68332) with about 16 Kilo- 
bytes of RAM and 256 Kilobytes of ROM. These microcon- 
trollers have built-in communication controllers (e.g., the 
68HC11 and 68HC12 automotive family of microcontrollers 
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have an on-chip CAN controller), I/O to sensors and actua- 
tors, and are interconnected by high bandwidth networks 
[Kop99], [Chi96], 

In order to provide accurate analysis techniques, we need to 
know the details of the communication protocols used to connect 
the components of the architecture. 

A large number of communication protocols are currently 
available for embedded systems. However, only a few of them 
are suitable for safety-critical applications where predictability 
is mandatory [RusOl]. A survey and comparison of communica- 
tion protocols for safety-critical embedded systems is available 
in [RusOl]. 

The duality between event-triggered and time-triggered 
approaches discussed in Section 3.1 is also reflected at the level 
of the communication infrastructure, where communication 
activities can be triggered either dynamically, in response to an 
event, or statically, at predetermined moments in time. 

Therefore, on one hand, there are protocols that schedule the 
messages statically based on the progression of time, for exam- 
ple, the SAFEbus [Hoy92] and SPIDER [MinOO] protocols for the 
avionics industry, and the TTCAN [Int02] and Time-Triggered 
Protocol (TTP) [Kop03] intended for the automotive industry. Out 
of these, Rushby concludes that TTP “is unique in being used for 
both automobile applications, where volume manufacture leads 
to very low prices, and aircraft, where a mature tradition of 
design and certification for flight-critical electronics provides 
strong scrutiny of arguments for safety” [RusOl]. 

On the other hand, there are several communication protocols 
where message scheduling is performed dynamically, such as 
Byteflight [BerOO] introduced by BMW for automotive applica- 
tions, Controller Area Network (CAN) [Bos91] used in a large 
number of application areas including automotive electronics, 
Lon Works [Ech03] and Profibus [Pro03] for real-time systems in 
general, etc. Out of these, CAN is the most well known and wide- 
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spread event-driven communication protocol in the area of dis- 
tributed embedded real-time systems. 

However, there is also a hybrid type of communication proto- 
cols, like the FlexRay protocol [Fle02] , that allows the sharing of 
the bus by event-driven and time-driven messages. 

Throughout this book we will use the TTP and CAN as repre- 
sentatives for time-driven and event-driven protocols, respec- 
tively. A detailed comparison of TTP and CAN is provided in 
[KopOl]. 

3.2.1 The Time-Triggered Protocol 

The time-triggered protocol [Kop03] was designed for distrib- 
uted real-time applications that require predictability and reli- 
ability (e.g., drive-by-wire [XbW98]). It integrates all the 
services necessary for fault-tolerant real-time systems. TTP ser- 
vices of importance to our problems are: message transport with 
acknowledgment and predictable low latency, clock s 3 mchroniza- 
tion within the microsecond range and rapid mode changes. 

The communication channel is a broadcast channel, so a mes- 
sage sent by a node is received by all the other nodes. The bus 
access scheme is time-division multiple-access (tdma) 
(Figure 3.2). Each node Ni can transmit only during a predeter- 
mined time interval, the so called TDMA slot Sj. In such a slot, a 
node can send several messages packaged in a frame. A 
sequence of slots corresponding to all the nodes in the architec- 
ture is called a TDMA round. A node can have only one slot in a 
TDMA round. Several TDMA rounds can be combined together in a 
cycle that is repeated periodically. The sequence and length of 
the slots are the same for all the TDMA rounds. However, the 
length and contents of the frames may differ. 

Every node has a TTP controller that implements the protocol 
services, and runs independently of the node’s CPU. Communica- 
tion with the CPU is performed through a so called message base 
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Figure 3.2; TTP Bus Access Scheme 



interface (MBi) which is usually implemented as a dual ported 
RAM (depicted in Figure 3.5). 

The TDMA access scheme is imposed by a so called message 
descriptor list (MEDL) that is located in every TTP controller. The 
MEDL basically contains the time when a frame has to be sent or 
received, the address of the frame in the MBI, and the length of 
the frame. The medl serves as a schedule table for the TTP con- 
troller which has to know when to send or receive a frame to or 
from the communication channel. 

The TTP controller provides each CPU with a timer interrupt 
based on a local clock, synchronized with the local clocks of the 
other nodes. The clock synchronization is done by comparing the 
a priori known time of arrival of a frame with the observed 
arrival time. By applying a clock S 5 mchronization algorithm, TTP 
provides a global time-base of known precision, without any 
overhead on the communication. 

There are two types of frames in the TTP. The initialization 
frames, or I-frames, which are needed for the initialization of a 
node, and the normal frames, or N-frames, which are the data 
frames containing, in their data field, the application messages. 
A TTP data frame (Figure 3.3) consists of the following fields: 
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Data field, up to 16 bytes 



I 



S' 



Control field, 8 bits 

- 1 initialization bit 

- 3 mode change bits 



CRC field, 16 bits 



Figure 3.3: TTP Frame Configuration 



start of frame bit (SOF), control field, a data field of up to 16 b 5 d;es 
containing one or more messages, and a cyclic redundancy check 
(CRC) field. Frames are delimited by the inter-frame delimiter 
(IFD, 3 bits). Note that no identifier bits are necessary, as the TTP 
controllers know from their MEDL what frame to expect at a 
given point in time. 

In general, the time-triggered protocol efficiency is in the 
range of 60-80% [Tec02] . 

Example 3.1: For example, the data efficiency of a frame 
that carries 8 bytes of application data, i.e., the percentage of 
transmitted bits which are the actual data bits needed by the 
application, is 69.5% (64 data bits transmitted in a 92-bit 
frame, without considering the details of a particular physi- 
cal layer). 

■ 

3.2.2 The Controller Area Network Protocol 

The controller area network [Bos91] is a priority bus that 
employs a collision avoidance mechanism, whereby the node 
that transmits the frame with the highest priority wins the con- 
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Data field, up to 8 bytes 




Arbitration field, 12 bits Control field, 6 bits CRC field, ACK field,EOF field, 

— 11 identifier bits - 4 data length 15 bits 2 bits 7 bits 

— 1 retransmission bit code bits 

- 2 reserved bits 



Figure 3.4: CAN 2.0A Data Frame Configuration 



tention. Frame priorities are unique and are encoded in the 
frame identifiers, which are the first bits to be transmitted on 
the bus. 

In the case of CAN 2.0A, there are four frame types: data 
frame, remote frame, error frame, and overload frame. We are 
mainly interested in the structure of the data frame, depicted in 
Figure 3.4. A data frame contains seven fields: SOF, arbitration 
field that encodes the 11 bit frame identifier, a control field, a 
data field up to 8 bytes, a CRC field, an acknowledgement (ACK) 
field, and an end of frame field (EOF). 

The typical CAN protocol efficiency is in the range of 25—35% 
[Tec02]. 

Example 3.2: For a frame that carries 8 bytes of application 
data, using the specification of a data frame presented in 
Figure 3.4, we will have an efficiency of 47.4% [NolOl]. 



3.3 Time-Driven Systems 

The first type of systems considered in the book are time-driven 
systems, in which processes are activated according to a time- 
triggered policy. Typically, in a time-driven system, messages 
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are transmitted using a time-driven communication protocol 
such as the TTP, while the scheduling of processes is performed 
using static cyclic scheduling. 

The hardware architecture consists of one single network, 
composed of a set of nodes interconnected using the TTP. The 
main component of the software architecture is a real-time ker- 
nel that runs on top of each node. 

The kernel running as part of the software architecture on 
each node has a schedule table. This schedule table contains all 
the information needed to take decisions on activation of pro- 
cesses and transmission of messages, on that particular node. 

In order to run a predictable hard real-time application, the 
overhead of the kernel and the worst case administrative over- 
head (WCAO) of every system call has to be determined. Having a 
time-triggered system, all the activity is derived from the pro- 
gression of time which means that there are no other interrupts 
except for the timer interrupt. 

Several activities, like polling of the I/O or diagnostics, take 
place directly in the timer interrupt routine. The overhead due 
to this routine is expressed as the utilization factor U^. Uf repre- 
sents a fraction of the CPU power utilized by the timer interrupt 
routine, and has an influence on the execution times of the pro- 
cesses. 

We also have to take into account the overheads for process 
activation and message passing. For process activation we con- 
sider an overhead ^PA- The message passing mechanism is illus- 
trated in Figure 3 . 5 , where we have three processes, P^, P2 and 
P3. Pj and P2 are mapped to node N-^ that transmits in slot <S^, 
and P3 is mapped to node N2 that transmits in slot 82- Message 
is transmitted between P^ and P2, which are on the same 
node, while message m2 is transmitted from P^ to P3 between the 
two nodes. We consider that each process has its own memory 
locations for the messages it sends or receives and that the 
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addresses of the memory locations are known to the kernel 
through the schedule table. 

Py is activated according to the schedule table, and when it 
finishes it calls the send kernel function in order to send rrii, and 
then 1712 . Based on the schedule table, the kernel copies m-^ from 
the corresponding memory location of Pj to the memory location 
of P2- The time needed for this operation represents the WCAO 
for sending a message between processes located on the same 
node. When P2 will be activated, it will find the message in the 
right location. According to our scheduling policy, whenever a 
receiving process needs a message, the message is already 
placed in the corresponding memory location. Thus, there is no 
overhead on the receiving side, for messages exchanged on the 
same node. 

Message m2 has to be sent from node N-^ to node W2. At a cer- 
tain time, known from the schedule table, the kernel transfers 
m2 to the TTP controller by packing m2 into a frame in the MBI. 
The WCAO of this function is ^KS- Later on, the TTP controller 
knows from its MEDL when it has to take the frame from the MBI, 
in order to broadcast it on the bus. In our example, the timing 
information in the schedule table of the kernel and the MEDL is 
determined in such a way that the broadcasting of the frame is 
done in the slot of Round 2 . The TTP controller of node N2 
knows from its MEDL that it has to read a frame from slot Si of 
Round 2 and to transfer it into the MBI. The kernel in node N2 
will read the message m2 from the MBI, with a corresponding 
WCAO of When P3 will be activated based on the local sched- 
ule table of node N2, it will already have m2 in its memory loca- 
tion. 



1 . The overheads Sg, and 8 ^^ depend on the length of the transferred 
message; in order to simplify the presentation this aspect is not dis- 
cussed further. 
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3.4 Event-Driven Systems 

The second type of systems considered in the book are the event- 
driven systems in which processes are managed according to an 
event-driven policy. 

The hardware architecture consists of one single network, 
composed of a set of nodes interconnected using a communica- 
tion channel. 

The scheduling of processes is performed using fixed-priority 
preemptive scheduling. A natural mapping of event-driven mes- 
sages would be on a bus implementing an event-triggered proto- 
col at the data-link layer, such as the CAN bus. Such a solution 
has been considered in literature, for example in [Tin95] . 

However, considering preemptive priority based scheduling at 
the process level, with time triggered static scheduling at the 
communication level can be the right solution under several cir- 
cumstances [L6n99]. Moreover, a communication protocol like 
the time-triggered protocol provides a global time base, and 
improves fault-tolerance and predictability. 

Therefore, in Part III of this book we will consider that mes- 
sages produced by event-triggered processes are transmitted 
using the time-triggered communication protocol, and we have 
developed four message passing policies for transmitting event- 
triggered messages over a time-triggered bus (see Section 6.4). 
However, for the event-triggered clusters of a multi-cluster sys- 
tem addressed in Part IV, we will consider that the communica- 
tions are performed using an event-triggered protocol such as 
the CAN protocol. 

As the main component of the software architecture, we have 
a real-time kernel running on the CPU of each node, which has a 
scheduler as one of its main components. This scheduler decides 
on activation of processes, based on their priorities. 

As in the previous section, the overhead of the kernel and the 
worst case administrative overhead (WCAO) of every system call 
have to be determined. Our schedulability analysis takes into 
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account these overheads, and also the overheads due to the mes- 
sage passing. 

The message passing mechanism is illustrated in Figure 3 . 6 , 
where we have three processes, P^, P2, and P3. As in the example 
illustrated in Figure 3 . 5 , Pj and P2 are mapped to node that 
transmits in slot Si, and P3 is mapped to node N2 that transmits 
in slot 82- Message is transmitted between P^ and P2 that are 
on the same node, while message m2 is transmitted from P^ to P3 
between the two nodes. 

Messages between processes located on the same processor 
are passed through shared protected objects. The overhead for 
their communication is accounted for by the blocking factor, 
using the analysis from [Sha 90 ] for the priority ceiling protocol. 

Message m2 has to be sent from node Ni to node N2. Hence, 
after m2 is produced by P^, it will be placed into an outgoing mes- 
sage queue, called Out. The access to the queue is guarded by a 
priority-ceiling semaphore. A so called transfer process (denoted 
with T in Figure 3 . 6 ) moves the message from the Out queue 
into the MBI. 

How the message queue is organized and how the message 
transfer process selects the particular messages and assembles 
them into a frame depend on the particular approach chosen for 
message scheduling (see Section 6 . 4 ). The message transfer pro- 
cess is activated at certain a priori known moments by the 
scheduler, in order to perform the message transfer. These acti- 
vation times are stored in a message handling time table (MHTT) 
available to the real-time kernel in each node. Both the MEDL 
and the MHTT are generated off-line as result of the schedulabil- 
ity analysis and optimization which will be discussed later. The 
MEDL imposes the times when the TTP controller of a certain 
node has to move frames from the MBI to the communication 
channel. The MHTT contains the times when messages have to be 
transferred by the message transfer process from the Out queue 
into the MBI, in order to be broadcast by the TTP controller. As 
result of this synchronization, the activation times in the MHTT 
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are directly related to those in the MEDL and the first table 
results directly from the second one. 

It is easy to observe that we have the most favorable situation 
when, at a certain activation, the message transfer process finds 
in the Out queue all the “expected” messages which then can be 
packed into the next following frame to be sent by the TTP con- 
troller. However, application processes are not statically sched- 
uled and availability of messages in the Out queue can not be 
guaranteed at fixed times. Worst case situations have to be con- 
sidered, as will be shown in Section 6.4. 

Let us come back to Figure 3.6. There we assumed a context in 
which the broadcasting of the frame containing message m 2 is 
done in the slot of Round 2. The TTP controller of node N 2 
knows from its MEDL that it has to read a frame from slot of 
Round 2 and to transfer it into its MBI. In order to synchronize 
with the TTP controller and to read the frame from the MBI, the 
scheduler on node N 2 will activate, based on its local MHTT, a so 
called delivery process, denoted with D in Figure 3.6. The deliv- 
ery process takes the frame from the MBI and extracts the mes- 
sages from it. For the case when a message is split into several 
packets, sent over several TDMA rounds, we consider that a mes- 
sage has arrived at the destination node after all its correspond- 
ing packets have arrived. When m 2 has arrived, the delivery 
process copies it to process P 3 which will be activated. Activation 
times for the delivery process are fixed in the MHTT just as 
explained earlier for the message transfer process. 

The number of activations of the message transfer and deliv- 
ery processes depends on the number of frames transferred, and 
they are taken into account for timing analysis, as well as the 
delay implied by the propagation on the communication bus. 
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3.5 Multi- Cluster Systems 

A multi-cluster system consists of several clusters, intercon- 
nected by gateways (Figure 3.7 depicts a two-cluster example). A 
cluster is composed of nodes which share a broadcast communi- 
cation channel. 

In a time-triggered cluster (TTC), processes and messages are 
scheduled according to a static cyclic policy, with the bus imple- 
menting the TTP. On an event-triggered cluster (ETC), the pro- 
cesses are scheduled according to a priority based preemptive 
approach, while messages are transmitted using the priority- 
based CAN protocol. 

The critical element of such an architecture is the gateway, 
which is a node connected to both types of clusters (hence having 
two communication controllers, for TTP and CAN), and is respon- 
sible for the inter-cluster routing of real-time traffic. 

Although in this book we consider only a two cluster system, 
as the one in Figure 3.7, the approaches presented can be easily 




TTP Controller I I CAN Controller 



Figure 3.7: A Two-Cluster System Example 
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extended to cluster configurations where there are several ETCs 
and TTCs interconnected by gateways. 

A real-time kernel is responsible for activation of processes 
and transmission of messages on each node. On a TTC, the pro- 
cesses are activated based on the local schedule tables, and mes- 
sages are transmitted according to the MEDL. On an ETC, we 
have a scheduler that decides on activation of ready processes 
and transmission of messages, based on their priorities. 

In Figure 3.8 we illustrate our message passing mechanism. 
Here we concentrate on the communication between processes 
located on different clusters. For message passing details within 
a TTC the reader is directed to Section 3.3, while the infrastruc- 
ture needed for communications on an ETC has been detailed in 
[Tin95]. 

Let us consider the example in Figure 3.8, where we have an 
application consisting of four processes, mapped on two clusters. 
Processes Pj and P4 are mapped on node of the TTC, while 
and P3 are mapped on node N2 of the ETC. Process P^ sends mes- 
sages and m2 to processes P2 and P3, respectively, while P2 
and P3 send messages and to P4. 

The transmission of messages from the TTC to the ETC takes 
place in the following way (see Figure 3.8). P4, which is statically 
scheduled, is activated according to the schedule table, and 
when it finishes it calls the send kernel function in order to send 
m-^ and m2, indicated in the figure by number (1). Messages m^ 
and m2 have to be sent from node N-^ to node N2- At a certain 
time, known from the schedule table, the kernel transfers m^ 
and m2 to the TTP controller by packing them into a frame in the 
MBI. Later on, the TTP controller knows from its MEDL when it 
has to take the frame from the MBI, in order to broadcast it on 
the bus. In our example, the timing information in the schedule 
table of the kernel and the MEDL is determined in such a way 
that the broadcasting of the frame is done in the slot Si of 
Round 2 (2). The TTP controller of the gateway node Nq knows 
from its MEDL that it has to read a frame from slot Si of Round 2 
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b) Message Passing Example 

Figure 3.8: A Message Passing Example for Multi-Cluster Systems 
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and to transfer it into its MBI (3). Invoked periodically, having 
the highest priority on node Nq, and with a period which guar- 
antees that no messages are lost, the gateway process T copies 
messages and m2 from the MBI to the TTP-to-CAN priority- 
ordered message queue Outcj^ (4). The highest priority mes- 
sage in the queue, in our case m-^, will tentatively be broadcast 
on the CAN bus (5). Whenever message will be the highest 
priority message on the CAN bus, it will successfully be broadcast 
and will be received by the interested nodes, in our case node N2 
(6). The CAN communication controller of node N2 receiving 
will copy it in the transfer buffer between the controller and the 
CPU, and raise an interrupt which will activate a delivery pro- 
cess D, responsible to activate the corresponding receiving pro- 
cess, in our case P2, and hand over message that finally 
arrives at the destination (7). 

Message (depicted in Figure 3.8 as a hashed rectangle) 
sent by process P 2 from the ETC will be transmitted to process P4 
on the TTC. The transmission starts when Pg calls its send func- 
tion and enqueues in the priority-ordered Outf^^ queue (8). 
When m3 has the highest priority on the bus, it will be removed 
from the queue (9) and broadcast on the CAN bus (10), arriving at 
the gateway’s CAN controller. The gateway transfer process T is 
activated, and m3 is placed in the Outq^p FIFO queue (11). The 
gateway node Nq is only able to broadcast on the TTC in its cor- 
responding slot Sq of the TDMA rounds circulating on the TTP 
bus. According to the MEDL of the gateway, a set of messages not 
exceeding sizeg^ of the slot Sq will be removed from the front of 
the Outrppp queue in every round, and packed in the Sq slot (12). 
Once the frame is broadcast (13) it will arrive at node N-^ (14), 
where all the messages in the frame will be copied in the input 
buffers of the destination processes (15). Process P4 is activated 
according to the schedule table, which has to be constructed 
such that it accounts for the worst-case communication delay of 
messages m3 and m4, bounded by the analysis in Section 8.2.1, 
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and, thus, when starts executing it will find and in its 
input buffer. 

This chapter has presented the architectures of the considered 
distributed systems. In part two of the book, consisting of chap- 
ters 4 and 5 we will address time-driven systems, in the third 
part, chapters 6 and 7, event-driven systems are considered, and 
in the fourth part, chapters 8, 9 and 10, we discuss issues related 
to multi-cluster systems. 
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PART II 

Time-Driven Systems 




Chapter 4 
Scheduling and 
Bus Access Optimization for 
Time-Driven Systems 



In this and in the following chapter we consider time-driven 
distributed real-time systems that use the time-triggered proto- 
col for their communication infrastructure, as described in 
Section 3.3. In this case, both the activation of processes and the 
transmission of messages are done based on the progression of 
time. The applications are modeled as a set of conditional pro- 
cess graphs, as presented in Section 2.3.1. 

Our goal in this chapter is to generate a static schedule and to 
optimize the parameters of the communication protocol, such 
that the worst-case delay by which the system completes execu- 
tion is minimized. 

The chapter starts by presenting an approach to static sched- 
uling with control and data dependencies for distributed real- 
time systems [Dob98], [Ele98a], [EleOO]. The approach considers 
a simplified communication model in which the execution time 
of the communication processes depends only on the amount of 
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data exchanged by the processes engaged in the communication. 
The communication processes are treated exactly as ordinary 
processes during scheduling, and the bus is modeled similar to a 
programmable processor that can “execute” one communication 
at a time as soon as the communication becomes “ready”. 

We propose in this chapter several extensions to this basic 
approach: 

• scheduling of messages using a realistic communication 
model based on the time-triggered protocol (Section 4.3.1); 

• a new priority function for list scheduling that uses knowl- 
edge about the bus access scheme in order to improve the 
schedule quality (Section 4.3.2); 

• optimization strategies for the synthesis of parameters of the 
communication protocol, aimed at improving the schedule 
quality (Section 4.4). 



4.1 Background 

Static cyclic scheduling of a set of data dependent software pro- 
cesses on a multiprocessor architecture has been intensively 
researched [Kop97a], [XuOO]. 

Several approaches are based on list scheduling heuristics 
using different priority criteria [Cof72], [Deo98], [Jor97], 
[Kwo96], [Wu90] or on branch-and-bound algorithms [Kas84]. 
These approaches are based on the assumption that a number of 
identical processors are available to which processes are pro- 
gressively assigned as the static schedule is elaborated. Such an 
assumption is obviously not acceptable for distributed embed- 
ded systems which are heterogeneous by nature. In [Jor97] a list 
scheduling based approach is extended to handle heterogeneous 
architectures. Scheduling is performed by progressively assign- 
ing tasks to the allocated processors with the goal to minimize 
the length of the schedule. The proposed algorithm handles only 
processors which execute one single process at a time (not typi- 
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cal for hardware) and the resulting partitioning does not take 
into consideration any design constraints. 

In [Ben96] , [Pra92] static scheduling and partitioning of pro- 
cesses, and allocation of system components, are formulated as a 
mixed integer linear programming (MILP) problem. A disadvan- 
tage of this approach is the complexity of solving the MILP model. 
The size of such a model grows quickly with the number of pro- 
cesses and allocated resources. In [Kuc97] a formulation using 
constraint logic programming has been proposed for similar 
problems. 

In all the previous approaches process interaction is only in 
terms of dataflow. However, when including control dependen- 
cies significant improvements in the quality of the resulting 
schedules can be obtained [Ele98a], [EleOO], [KucOl]. Section 4.2 
presents in more detail related research on the static scheduling 
for systems with control and data dependencies that is used as a 
starting point for our work. 

It has been claimed PCu93] that static cyclic scheduling is the 
only approach that can provide solutions to applications that 
exhibit data dependencies. However, advances in the area of 
fixed priority preemptive scheduling show that such applica- 
tions can also be handled with other scheduling strategies 
[Aud93], [Tin94b], [Pal98], [Pal99]. 

Currently, more and more real-time systems are used in phys- 
ically distributed environments and have to be implemented on 
distributed architectures in order to meet reliability, functional, 
and performance constraints. However, researchers have often 
ignored or very much simplified aspects concerning the commu- 
nication infrastructure. 

One t 5 q)ical approach is to consider communication processes 
as processes with a given execution time (depending on the 
amount of information exchanged) and to schedule them as any 
other process, without considering issues like communication 
protocol, bus arbitration, packing of messages, clock synchroni- 
zation, etc. These aspects are, however, essential in the context 
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of safety-critical distributed real-time applications and one of 
our objectives is to develop a strategy which takes them into con- 
sideration for process scheduling. 

Many efforts dedicated to communication synthesis have con- 
centrated on the synthesis support for the communication infra- 
structure but without considering hard real-time constraints 
and system level scheduling aspects [Cho95b], [Dav95], [Knu99], 
[Nar94]. Lower level communication synthesis aspects under 
timing constraints have been addressed in [Ort98] , [Knu99] . 

4.2 Scheduling with Control and Data 
Dependencies 

The problem which is discussed in this section can be formulated 
as follows: Given an application distributed on a time-driven 
system (Section 3.3), modeled as a set of mapped conditional 
process graphs (Section 2.3.1), we are interested to generated a 
static schedule such that the worst-case delay by which the sys- 
tem completes execution is minimized. 

According to our application model, some processes can only be 
activated if certain conditions, computed by previously executed 
processes, are fulfilled. Hence, process scheduling is complicated 
since at a given activation of the system, only a certain subset of 
the total amount of processes is executed and this subset differs 
from one activation to the other. 

As the values of the conditions are unpredictable, the decision 
on which process to activate and at which time has to be taken 
without knowing which values the conditions will later get. On 
the other side, at a certain moment during execution, when the 
values of some conditions are already known, they have to be 
used in order to take the best possible decisions on when and 
which process to activate. Heuristic algorithms have to be devel- 
oped to produce a schedule of the processes such that the worst 
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case delay is as small as possible. One such algorithm will be 
presented in Section 4.2.1. 

The output produced by their scheduling algorithm is a sched- 
ule table that contains all the information needed by a distrib- 
uted run time scheduler to take decisions on activation of 
processes. It is considered that, during execution, a very simple 
non-preemptive scheduler located in each processing element 
decides on process and communication activation depending on 
the actual values of conditions. Only one part of the table has to 
be stored in each processor, namely, the part concerning deci- 
sions which are taken by the corresponding scheduler. 

Example 4.1; Under these assumptions. Table 4.1 presents 
a possible schedule (produced by the algorithm in Figure 4.1) 
for the conditional process graph in Figure 2.5 on page 29. In 
Table 4.1 there is one row for each “ordinary” or communica- 
tion process, which contains activation times corresponding 
to different values of conditions. Each column in the table is 
headed by a logical expression constructed as a conjunction 
of condition values. Activation times in a given column repre- 
sent starting times of the processes when the respective 
expression is true. 

According to the schedule in Table 4.1 process is acti- 
vated unconditionally at the time 0, given in the first column 
of the table. Activation of the rest of the processes, in a cer- 
tain execution cycle, depends on the values of the conditions, 
which are unpredictable. For example, process has to be 
activated at ^ = 44 if C a D is true and at ^ = 52 if C a D holds. 

■ 

At a certain moment during the execution, when the values of 
some conditions are already known, they have to be used in 
order to take the best possible decisions on when and which pro- 
cess to activate. Therefore, after the termination of a process 
that produces a condition (disjunction process), the value of the 
condition is broadcast from the corresponding processor to all 
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other processors. This broadcast is scheduled as soon as possible 
on the communication channel, and is considered together with 
the scheduling of the messages. 

To produce a deterministic behavior, which is correct for any 
combination of conditions, the table has to fulfill several require- 
ments: 

1. No process will be activated if, for a given execution, the con- 
ditions required for its activation are not fulfilled. 

2. Activation times have to be uniquely determined by the con- 
ditions. 

3. Activation of a process at a certain time t has to depend 
only on condition values which are determined at the respec- 
tive moment t and are known to the processing element 
which executes P^. 

4.2.1 List Scheduling based Algorithm 

Optimal scheduling has been proven to be an NP-complete prob- 
lem [U1175] in even simpler contexts than those characteristic to 
distributed systems represented as CPGs. Hence, it is essential 
to develop heuristics which produce good quality results in a 
reasonable time. 

In [Dob98], [Ele98a], [EleOO] the authors concentrate on 
developing a scheduling algorithm for systems with both control 
and data dependencies, modeled using the conditional process 
graph. As the starting point for our improved scheduling tech- 
nique that is tailored for time-triggered embedded systems we 
consider the list scheduling based algorithm in [Dob98] , [EleOO] 
presented, in a simplified form, in Figure 4.1. 

List scheduling heuristics [Ele98b] , [EleOO] are based on pri- 
ority lists from which processes are extracted in order to be 
scheduled at certain moments. In the algorithm presented in 
Figure 4.1, there is such a list, ReadyList, which contains the pro- 
cesses eligible to be activated on the corresponding processor at 
time CurrentTime. These are processes which have not yet been 
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scheduled but have all predecessors already scheduled and ter- 
minated. 

The ListScheduling function is recursive and calls itself for each 
disjunction node in order to separately schedule the nodes in the 
true branch, and those in the false branch, respectively (lines 10 
and 13 in Figure 4.1). Thus, the alternative paths are not acti- 
vated simultaneously and resource sharing is correctly achieved 
(for details on how the algorithm fulfills the three requirements 
on the schedule table, identified earlier, we refer to [EleOO]). 

An essential component of a list scheduling heuristic is the 
priority function used to solve conflicts between ready processes. 
As can be observed in Figure 4.1, the highest priority process 



ListScheduling(Cu/renfr/AT 7 e, ReadyList, KnownConditions) 

1 repeat 

2 Update(AeadyL/sO 

3 for each processing element PE do 

4 if PE is free at CurrentTime then 

5 P, = GetReadyProcess(ReadyL/sO 

6 if there exists a P, then 

7 insert(P„ ScheduleTable, CurrentTime, KnownConds) 

8 if P, is a disjunction process then 

9 Cl = condition caiculated by P, 

10 ListScheduiing(Currentr/me, 

1 1 ReadyList u ready nodes from the true branch, 

12 KnownConditions u true Cj 

13 ListScheduiing(CurrenfT/AT 7 e, 

14 ReadyList u ready nodes from the faise branch, 

15 KnownConditions u faise C,) 

16 end if 

17 end if 

18 end if 

19 end for 

2 0 CurrentTime = earliest time when a scheduled process terminates 

21 until ail processes of this aiternative path are scheduled 
end ListScheduling 



Figure 4.1: List Scheduling Based Algorithm for CPGs 
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will be extracted by function GetReadyProcess from the ReadyList 
in order to be scheduled (line 5). 

4.2.2 PCP Priority Function 

Priorities for list scheduling very often are based on the critical 
path (CP) firom the respective process to the sink node. Thus, for CP 
scheduling, the priority assigned to a process Pj will be the maxi- 
mal execution time from the current node to the sink: 



/ 



P, 



max 

k 



S ^Pj’ 

Pj^ ^ik 



(4.1) 



where is the path from node P^ to the sink node. 

Considering the particularities of our problem, significant 
improvements of the resulting schedule can be obtained, without 
any penalty in scheduling time, by making use of the available 
information on process allocation [Ele98b] . 

Let us consider the graph in Figure 4.2 and suppose that the 
list scheduling algorithm has to decide between scheduling pro- 
cess P^ or Pg which are both ready to be scheduled on the same 
programmable processor or bus pe^. In Figure 4.2 we depicted only 
the critical path from P^ and Pg to the sink node. Let us consider 
that Px is the last successor of P^ on the critical path such that all 
processes from P^ to Px are assigned to the same processing ele- 
ment pCj. The same holds for Py relative to Pg. Times and 
are the total execution time of the chain of processes from P^ to 
Px and from Pg to Py, respectively, following the critical paths. 
Times and A.g are the total execution times of the processes on 
the rest of the two critical paths. Thus, considering Equation 4.1 
we have the following critical paths for P^ and Pg, respectively: 



Ip A ~ 



+ ^A> he 



— ^g + A. 



B- 



However, the algorithm proposed in [Ele98b] does not use the 
length of these critical paths as a priority. The policy in [Ele98b] 
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Figure 4.2; Delay Estimation for PCP Scheduling 



is based on the estimation of a lower bound L on the total delay, 
taking into consideration that the two chains of processes - 
Px and Pg - Py are executed on the same processor. Lp^ and Lp^ 
are the lower bounds on the delay if P^ and Pb, respectively, are 
scheduled first: 

Lp^ = max{T_current + t^ + T_current + tj>^ + tp + Xp), 

Lp^ = max(T_current + tp + Xp, Tjcurrent + tp + tx + X^). 

The alternative that offers the perspective of the shorter delay 
L = min(Lp^, Lp^) is selected. It can be observed that if > Xp 
then Lp^ < Lp^, which means that we have to schedule P^ first so 
that L = Lpx, similarly if Xp > X^ then Lp^ < Lp^, and we have to 
schedule Pp first in order to get L = Lp^. 
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4.3 Scheduling for Time-Driven Systems 

In the previous sections we were interested to derive a static 
schedule table such that the worst-case delay of an application, 
modeled as conditional process graphs, is minimized. In this sec- 
tion, we propose several extensions to the scheduling algorithm 
briefly described in Section 4.2. The extensions consider a real- 
istic communication and execution infrastructure, and include 
aspects of the communication protocol in the optimization pro- 
cess. 

As an input to our problem we consider a safety-critical appli- 
cation modeled as a set of conditional process graphs, see 
Section 2.3.1. The architecture of the system is given as 
described in Section 3.3. Each process of the application is 
mapped on a processor. The worst-case execution time for each 
process is known, as well as the length of each message. 

We are interested to derive the worst case delay on the system 
execution time, so that this delay is as small as possible, and to 
synthesize the local schedule tables for each node, as well as the 
MEDL for the TTP controllers, which guarantee this delay. 

Considering the concrete definition of our problem, which 
takes into account the details of the communication protocol, the 
communication time is no longer dependent only on the length of 
the message, as assumed in the previous section. Hence, if the 
message is sent between two processes mapped onto different 
nodes, the message has to be scheduled according to the TTP pro- 
tocol. Several messages can be packaged together in the data 
field of a frame. The number of messages that can be packed 
depends on the slot length corresponding to the node. The effec- 
tive time spent by a message on the bus is / s, where 

Sg is the length of the slot Sj and s is the transmission speed of 
the channel. Therefore, the communication time C does not 

/Ti' ^ 

depend on the bit length of the message mj, but on the slot 
length corresponding to the node sending mj. 
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Example 4.2: The important impact of the communication 
parameters on the performance of the application is illus- 
trated in Figure 4.3 by means of a simple example. In 
Figure 4.3d we have a process graph consisting of four pro- 
cesses to P 4 and four messages to m^. The architecture 
consists of two nodes interconnected by a TTP channel. The 
first node transmits on the slot of the TDMA round and 

the second node N 2 transmits on the slot S^. Processes Pi 
and P 4 are mapped on node Ni, while processes P 2 and P 3 are 
mapped on node N 2 - 

With the TDMA configuration in Figure 4.3a, where the slot 
S 2 is scheduled first and slot -Sj is second, we have a result- 
ing schedule length of 24 ms. However, if we swap the two 
slots inside the TDMA round without changing their lengths, 
we can improve the schedule by 2 ms, as seen on Figure 4.3b. 

Furthermore, if we have the TDMA configuration in 
Figure 4.3c where slot Si is first, slot S 2 is second and we 
increase the slot lengths so that the slots can accommodate 
both of the messages generated on the same node, we obtain 
a schedule length of 20 ms which is optimal. 

However, increasing the length of slots does not necessar- 
ily improve a schedule, as it delays the communication of mes- 
sages generated by other nodes. 

■ 

In the next two sections our goal is to synthesize the local 
schedule table of each node and the MEDL of the TTP controller 
for a given order of slots in the TDMA round and given slot 
lengths. The ordering of slots and the optimization of slot 
lengths will be discussed in Section 4.4. 

4.3.1 Scheduling of Messages with the TTP 

Given a certain bus access scheme, which means a given order- 
ing of the slots in the TDMA round and fixed slot lengths, a CPG 
has to be scheduled with the goal to minimize the worst case 
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Round 1 Round 2 Round 3 

c) Schedule length of 20 ms 

Figure 4.3: Static Cyclic Scheduling Examples with the TTP 











Chapter 4 



execution delay. This can be performed using the algorithm 
ListScheduling (Figure 4.1) presented in Section 4.2.1. Two aspects 
have to be discussed here: the planning of messages in predeter- 
mined slots and the impact of this communication strategy on 
the priority assignment. 

The function ScheduleMessage in Figure 4.4 is called in order 
to plan the communication of a message m, with length S^, gen- 
erated on Node^ and which is ready to be transmitted at 
TimeReady. The ScheduleMessage function is called immediately 
following line five in Figure 4.1, considering the processing ele- 
ment PE as the bus, Pj as the message m (produced with a corre- 
sponding GetReadyMessage), and with TimeReady = CurrentTime. 

ScheduleMessage returns the earliest round and the corre- 
sponding slot (the slot corresponding to Node^) which can host 
the message. In Figure 4.4 RoundLength is the length of a TDMA 
round expressed in time units (in Figure 4.5, for example, 
RoundLength = 18 ms). The first round after TimeReady is the ini- 



ScheduleMessage (TimeReady, S^, Node^) 

1 — the slot in which the message has to he sent 

2 S/of=the slot assigned to Node^ 

3 — the first round which could be a candidate 

4 Round = TimeReady/ RoundLength~\ 

5 — is the right slot in this round already gone? 

6 if TimeReady- Round* RoundLength > sfarfs^, then 

7 — if yes, take the next round 

8 Round = Round + 1 

9 end if 

10 - is enough space left in the slot for the message? 

11 while > Sgipf— S occupied 

12 — if not, take the next round 

13 Round = Round + 1 

14 end while 

15 - return the right round and slot 

16 return (Round, Slot) 
end ScheduleMessage 



Figure 4.4: The ScheduleMessage Function 
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tial candidate to be considered (line 4). For this round, however, 
it can be too late to catch the corresponding slot, in which case 
the next round is selected, lines 5-9. When a candidate round is 
selected we have to check, in line 11, that there is enough space 
left in the slot for our message (Soccjpied represents the total num- 
ber of bits occupied by messages already scheduled in the 
respective slot of that round). If no space is left, the communica- 
tion has to be delayed for another round (line 13). 

With this message scheduling scheme, the algorithm in 
Figure 4.1 will generate correct schedules for a TTP based archi- 
tecture, with guaranteed worst-case execution delays. However, 
the quality of the schedules can be much improved by adapting 
the priority assignment scheme so that particularities of the 
communication protocol are taken into consideration. 

4.3.2 Improved Priority Function 

For the scheduling algorithm outlined previously we initially 
used the Partial Critical Path (PCP) priority function presented 
in Section 4.2.2. As discussed before, PCP uses as a priority crite- 
rion the length of that part of the critical path corresponding to a 
process Pj which starts with the first successor of Pj that is 
assigned to a processor different from the processor running Pj. 
The PCP priority function is statically computed once at the 
beginning of the scheduling procedure. 

However, considering the concrete definition of our problem, 
significant improvements of the resulting schedule can be 
obtained by including knowledge of the bus access scheme into 
the priority function. This new priority function will be used by 
the GetReadyProcess (Figure 4.1) in order to decide which process 
to select from the list of ready process. 

Example 4.3: Let us consider the graph in Figure 4.5c, and 
suppose that the list scheduling algorithm has to decide 
whether to schedule process Pj or Pg which are both ready to 
be scheduled on the same programmable processor. The 
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worst-case execution time of the processes is depicted on the 
right side of the respective node and is expressed in ms. The 
architecture consists of two nodes interconnected by a TTP 
channel. Processes and P2 are mapped on node N-^, while 
processes P3 and P4 are mapped on node N2. Node N-^ trans- 
mits in slot Sj of the TDMA round and N2 transmits in slot 82- 
Slot Si has a length of 10 ms while slot 82 has a length of 8 
ms. For simplicity we suppose that there is no message 
transferred between Pj and P3. The PCP (see Section 4.1.2) 
assigns a higher priority to P^ because it has a partial criti- 
cal path of 12, starting from P3, longer than the partial criti- 
cal path of P2 which is 10 and starts from m. This results in a 
schedule length of 40 ms as depicted in Figure 4.5a. On the 
other hand, if we schedule P2 first, the resulting schedule, 
depicted in Figure 4.5b, is of only 36 ms. 

This apparent anomaly is due to the fact that the way we 
have computed PCP priorities, considering message commu- 
nication as a simple activity of delay 6ms, is not realistic in 
the context of a TDMA protocol. Let us consider the particular 
TDMA configuration in Figure 4.4 and suppose that the 
scheduler has to decide at ^ = 0, which one of the processes P^ 
or Pg to schedule. If P2 is scheduled, the message is ready to 
be transmitted at ^' = 8. Based on a computation similar to 
that used in Figure 4.5, it follows that message m will be 
placed in round "8/ 18]^ = 1, and it arrives in time to get slot 
8^ of that round (TimeReady = 8 < starts^ = 10). Thus, m 
arrives at = 18, which means a delay relative to ^' = 8 
(when the message was ready) of 6 = 10. This is the delay 
that should be considered for computing the partial critical 
path of P2, which now results in 5+^p^- 14 (longer than the 
one corresponding to P^). 



1. The operator [jcl is the ceiling operator, which returns the smallest 
integer greater than or equal to x. 
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The obvious conclusion is that priority estimation has to be 
based on message planning with the TDMA scheme. Such an esti- 
mation, however, cannot be performed statically, before schedul- 
ing. If we take the same example in Figure 4.5, but consider that 
the priority based decision is taken by the scheduler at ^ = 5, m 
will be ready ait' = 13. This is too late for m to get into slot S-^ of 
Round 1. The message arrives with Round 2 at = 36. This 
leads to a delay due to the message passing of 5 = 36 — 13 = 23, 
different from the one computed above. 

We introduce, therefore, a new priority function, the Modified 
PCP (mpcp), which is computed during scheduling, whenever sev- 
eral processes are in competition to be scheduled on the same 
resource. Similar to PCP, the priority metric is the length of that 
portion of the critical path corresponding to a process Pj which 
starts with the first successor of Pj that is assigned to a proces- 
sor different from M(Pj). The critical path estimation starts with 
time t at which the processes in competition are ready to be 
scheduled on the available resource. During the partial tra- 
versal of the graph the delay introduced by a certain node Pj is 
estimated as follows: 

5 _ j if Pj i® ^ message passing 
Krr-t',iiPj is a message passing 

The term t ' is the time when the node generating the message 
terminates (and the message is ready); is the time when the 
slot to which the message is supposed to be assigned has 
arrived. The slot is determined like in Figure 4.4, but without 
taking into consideration space limitations in slots. 

Thus, the priority function MPCP has to be d 5 mamically deter- 
mined during the scheduling algorithm for each ready process, 
every time the GetReadyProcess function is activated in order to 
select a process from the ReadyList. The computation of )i, used in 
MPCP similarly to the PCP case (see Section 4.2.2), is performed 
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inside the GetReadyProcess function and involves a partial tra- 
versal of the graph, as presented in Figure 4.6. 

As the experimental results (Section 4.5) show, using MPCP 
instead of POP for the TTP based architecture results in an impor- 
tant improvement of the quality of generated schedules, only 
with a slight increase in scheduling time. 



4.4 Bus Access Optimization 

In the previous sections we have shown how the algorithm 
ListScheduling can produce an efficient schedule for a CPG, given a 
certain TDMA bus access scheme. However, as shown in 



Lambda(lambda, CurrentProcess) 

1 if CurrentProcess is a message then 

2 slot = slot of node sending CurrentProcess 

3 round = lambda / RoundLength 

4 if lambda - RoundLength * round > start of slot in round then 

5 round = next round 

6 end if 

7 while not message fits in the slot of round do 

8 round = next round 

9 end while 

1 0 lambda = round * RoundLength + start of slot in round + length of slot 

11 else 

12 lambda = lambda + WCET of CurrentProcess 

13 end if 

14 if lambda > MaxLambda then 

15 MaxLambda = lambda 

16 end if 

17 for each successors CurrentProcess do 

18 Lambda(/amdda, successor) 

19 end for 

2 0 return MaxLambda 
end Lambda 



Figure 4.6; The Lambda Function 
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Figure 4.3 on page 77, both the ordering of slots and the slot 
lengths strongly influence the worst-case execution delay of the 
system. 

In this section, we first present a heuristic which, based on a 
greedy approach, determines an ordering of slots and their 
lengths so that the worst-case delay corresponding to a certain 
CPG is as small as possible. Then, we present an algorithm based 
on a simulated annealing strategy, which finds that bus configu- 
ration which leads to the near-optimal delay for a CPG. 

4.4.1 Greedy Approaches 

Figure 4.7 presents a greedy heuristic that starts with deter- 
mining an initial solution, the so called “straightforward” one, 
which assigns in order nodes to the slots (Nodeg. = iVj) and fixes 
the slot length length g. to the minimal allowed value, which is 
equal to the length of the largest message generated by a process 
assigned to Nodeg. (lines 1-5). 

The next step of the algorithm starts with the first slot and 
tries to find the node which, when transmitting in this slot, will 
minimize the worst case delay of the system, as produced by 
ListScheduling. Simultaneously with searching for the right node 
to be assigned to the slot, the algorithm looks for the optimal slot 
length (lines 12—18). Once a node was selected for the first slot 
and a slot length fixed (line 23), the algorithm continues with 
the next slots, trying to assign nodes (and to fix slot lengths) 
from those nodes which have not yet been assigned. 

When calculating the length of a certain slot, a first alterna- 
tive could be to try all the slot lengths allowed by the protocol. 
Such an approach starts with the minimum slot length deter- 
mined by the largest message to be sent from the candidate 
node, and it continues incrementing with the smallest data unit 
(e.g., 2 bits) up to the largest slot length determined by the max- 
imum allowed data field in a TTP frame (e.g., 32 bits, depending 
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on the controller implementation). We call this alternative 
Opti mize Access 1 . 

A second alternative, OptimizeAccess2, is based on a feedback 
from the scheduling algorithm which recommends slot sizes to 
be tried out. Before starting the actual optimization process for 
the bus access scheme, a scheduling of the straightforward solu- 
tion (determined in lines 1-5) is performed which generates the 
recommended slot lengths. These lengths are produced by the 
ScheduleMessage function (Figure 4.4), whenever a new round 
has to be selected because of lack of space in the current slot. In 
such a case the slot length which would be needed in order to 
accommodate the new message is added to the list of recom- 

OptimizeAccess 

1 — creates the initial, straightforward solution 

2 for / = 1 to NrSlot do 

3 Nodes = 

4 length s= MinLengths^ 

5 end for 

6 — over all slots 

7 for /■ = 1 to NrSlot do 

8 — over all slots which have not yet heen allocated 

9 - a node and slot length 

1 0 for j = / to NrSlot do 

11 swap values (Nodes;, lengths) with (Nodesj, lengths) 

12 - initially, length s^ has the minimal allowed value 

13 for all slot lengths Sg, larger than lengths^ do 

14 lengths = 

15 ListScheduling( ... ) 

16 remember BestSolutlon = (Nodes^, lengths), 

17 with the smallest S^ax produced by ListScheduling 

18 end for 

19 swap back values (Nodes,, with (Nodesj, l^f^Qt^s) 

2 0 to the state before entering the for cycle 

21 end for 

22 - slot S; gets a node allocated and a length fixed 

23 Bind (Nodesi, lengths) = BestSolutlon 

24 end for 

end OptimizeAccess 

Figure 4.7: Optimization of the Bus Access Scheme 
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mended lengths for the respective slot. With this alternative, the 
optimization algorithm in Figure 4.7 only selects among the rec- 
ommended lengths when searching for the right dimension of a 
certain slot (line 13). 

4.4.2 Simulated Annealing 

The second algorithm we have developed is based on a simulated 
annealing (SA) strategy, described in detail in Appendix A. 

The greedy strategy constructs the solution by progressively 
selecting the best candidate in terms of the schedule length pro- 
duced by the function ListScheduling. Unlike the greedy strategy, 
SA will try to escape from a local optimum by randomly choosing 
a neighboring solution, see Figure A. 1 on page 286 in 
Appendix A. 

The neighbors of the current solution are obtained by a per- 
mutation of the slots in the TDMA round and/or by increasing/ 
decreasing the slot lengths. We generate the new solution by 
either randomly swapping two slots (with a probability 0.3) and/ 
or by increasing/decreasing with the smallest data unit the 
length of a randomly selected slot (with a probability 0.7). These 
probabilities have been determined experimentally. 

For graphs with 160 and less processes we were able to run an 
exhaustive search that found the optimal solutions. For the rest 
of the graph dimensions, we performed very long and expensive 
runs with the SA algorithm, and the best solution ever produced 
has been considered as the optimum for the further experi- 
ments. Based on further experiments we have determined the 
parameters of the SA algorithm so that the optimization time is 
reduced as much as possible but the optimal result is still pro- 
duced (see Appendix A for the details on these parameters). For 
example, for the graphs with 320 nodes, the initial temperature 
77 is 500, the temperature length parameter TL is 400 and the 
cooling ratio e is 0.97. The algorithm stops if for three consecu- 
tive temperatures no new solution has been accepted. 
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4.5 Experimental Evaluation 

For the evaluation of our scheduling algorithms we first used 
conditional process graphs generated for experimental purpose. 
We considered architectures consisting of 2, 4, 6, 8 or 10 nodes. 
Forty processes were assigned to each node, resulting in applica- 
tions of 80, 160, 240, 320 or 400 processes. Thirty applications 
were generated for each dimension, thus a total of 150 applica- 
tions were used for the experimental evaluation. Execution 
times and message lengths were assigned randomly using both 
uniform and exponential distribution. For the communication 
channel we considered a transmission speed of 256 Kbps and a 
length below 20 meters. The maximum length of the data field 
was 8 bytes, and the frequency of the TTP controller was chosen 
to be 20 MHz. All experiments were run on a SPARCstation 20. 

4.5.1 Priorities for the TTP Scheduling 

The first result concerns the quality of the schedules produced 
by the list scheduling based algorithm using the PCP and the 
MPCP priority functions. In order to compare the two priority 
functions, we have calculated the average percentage deviations 
of the schedule length produced with PCP and MPCP from the 
length of the best schedule between the two. The results are 
depicted in Figure 4.8a. In average the deviation with MPCP is 
11.34 times smaller than with PCP. However, due to its dynamic 
nature, MPCP has in average a bigger execution time than PCP. 
The average execution times for the ListScheduling function using 
PCP and MPCP are depicted in Figure 4.8b and are under half a 
second for graphs with 400 processes. 

4.5.2 Bus Access Optimization Heuristics 

In the next experiments we were interested to check the poten- 
tial of the algorithms presented in Section 4.4 to improve the 
generated schedules by optimizing the bus access scheme. We 
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a) Quality of schedules with PCP and MPCP 
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a) Average execution time of PCP and MPCP 
Figure 4.8: Comparison of the Two Priority Functions 
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compared schedule lengths, obtained for the 150 applications in 
the previous section, considering four different bus access 
schemes: the straightforward solution, the optimized schemes 
generated with the two alternatives of our greedy algorithm 
(OptimizeAccessI and OptimizeAccess2) and the near-optimal 
scheme produced using the simulated annealing (SA) based algo- 
rithm. Very long and extensive runs have been performed with 
the SA algorithm for each application and the best ever solution 
produced has been considered as the near-optimum for that 
case. 

Table 4.2 presents the average and maximum percentage 
deviation of the schedule lengths obtained with the straightfor- 
ward solution and with the two optimized schemes from the 
length obtained with the near-optimal scheme. For each of the 
application dimensions, the average optimization time, 
expressed in seconds, is also given. 

The first conclusion is that by considering the optimization of 
the bus access scheme, the results improve significantly com- 
pared to the straightforward solution. The greedy heuristic per- 
forms well for all the graph dimensions. As expected, the 
alternative OptimizeAccessI (which considers all allowed slot 
lengths) produces slightly better results, on average, than 



Table 4.2: Evaluation of the Bus Access Optimization 

Algorithms 



No. of 
proc. 


Straightforward 

solution 


OptimizeAccessI 


OptimizeAccess2 


avg. 

dev. 


max. 

dev. 


avg. 

dev. 


max. 

dev. 


exec. 

time 


avg. 

dev. 


max. 

dev. 


exec. 

time 


80 


3.16% 


21% 


0.02% 


0.5% 


0.25s 


1.8% 


19.7% 


0.04s 


160 


14.4% 


53.4% 


2.5% 


9.5% 


2.07s 


4.9% 


26.3% 


0.28s 


240 


37.6% 


110% 


7.4% 


24.8% 


10.46s 


9.3% 


31.4% 


1.34s 


320 


51.5% 


135% 


8.5% 


31.9% 


34.69s 


12.1% 


37.1% 


4.8s 


400 


48% 


135% 


10.5% 


32.9% 


56.04s 


11.8% 


31.6% 


8.2s 



89 




Chapter 4 



OptimizeAccess2. However, the execution times are much smaller 
for OptimizeAccess2. It is interesting to mention that the average 
execution times for the SA algorithm, needed to find the near- 
optimal solutions, are between 5 minutes for the applications 
with 80 processes and 275 minutes for 400 processes. 

4.5.3 The Vehicle Cruise Controller 

Finally, we have evaluated our approaches using the cruise con- 
troller case study presented in Section 2.3.3. For the implemen- 
tation of the cruise controller as a time-driven system we have 
considered: 

• the hardware architecture from Figure 2.7a on page 37, con- 
sisting of five nodes interconnected by a TTP bus, 

• the software architecture for time-triggered systems, out- 
lined in Section 3.3, 

• the mapped model presented in Figure 2.9 on page 40, hav- 
ing 32 processes and two conditions, 

• and a deadline of 400 ms. 

Thus, for the cruise controller example, the straightforward 
solution for bus access resulted in a schedule corresponding to a 
maximal delay of 429 ms (which does not meet the deadline) 
when PCP was used as a priority function, while using MPCP we 
obtained a schedule length of 398 ms. The first and second 
greedy heuristics for bus access optimization produced solutions 
that reduced the worst-case delay to 314 and 323 ms, respec- 
tively. The near-optimal solution (produced with the SA based 
approach) results in a delay of 302 ms. The greedy heuristics 
and the SA have used MPCP as the priority function for list sched- 
uling. 

This shows that the quality of generated schedules can be 
improved by considering the exact details of the communication 
protocol, and by optimizing the bus access scheme. 
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Using as a basis the timing analysis and communication syn- 
thesis developed in this chapter, in the next chapter we will 
address the mapping design task within an incremental design 
environment. 
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In this chapter we present an approach to mapping and 
scheduling for time-driven systems where processes are scheduled 
according to a non-preemptive static cyclic scheduling scheme, 
and communication uses a time division multiple access (TDMA) 
protocol. We accurately take into consideration the communica- 
tion costs and consider, during the mapping and scheduling pro- 
cess, the particular requirements of the communication protocol. 

The mapping and scheduling tasks are considered in the con- 
text of an incremental design process as outlined in Section 2.2. 
This implies that we perform mapping and scheduling of new 
functionality on a given distributed embedded system, so that 
certain design constraints are satisfied and, in addition; 

1. The already running applications are disturbed as little as 
possible. 

2. There is a good chance that, later, new functionality can eas- 
ily be mapped on the resulted system. 

We propose a new heuristic, together with the corresponding 
design criteria, which finds the set of old applications that have 
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to be re-mapped and rescheduled at the same time with map- 
ping and scheduling the new application, such that the distur- 
bance on the running system (expressed as the total cost implied 
by the modifications) is minimized. Once this set of applications 
has been determined, mapping and scheduling are performed 
according to the requirements stated above. 

Supporting such a design process is of critical importance for 
current and future industrial practice, as the time interval 
between successive generations of a product is continuously 
decreasing, while the complexity due to increased sophistication 
of new functionality is growing rapidly. The goal of reducing the 
overall cost of successive product generations has been one of 
the main motors behind the, currently very popular, concept of 
platform-based design (see Section 2.1.4). 

Addressing mapping and scheduling inside an incremental 
design process is not limited to time-driven systems. In Chapter 
7 we investigate the issues arising from considering incremental 
mapping and scheduling in the context of event-driven systems, 
where processes are scheduled according to a fixed-priority pre- 
emptive scheme, while messages are sent using the TTP. 

For the sake of simplifying the discussion, we will not address 
here, nor in Chapter 7, the memory constraints during process 
mapping and the implications of memory space in the incremen- 
tal design process. 

This chapter is organized as follows. The next section presents 
some issues related to mapping and scheduling in the context of 
a system based on a TDMA communication protocol. In 
Section 5.2 the problem we are going to solve is formulated. 
Section 5.3 introduces our approach to quantitatively character- 
ize certain features of future applications. In Section 5.3 we 
introduce the metrics we have defined in order to capture the 
quality of a given design alternative and, based on these met- 
rics, we give an exact problem formulation. Our mapping and 
scheduling strategy is described in Section 5.4 and the experi- 
mental results are presented in Section 5.5. 
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5.1 Background 

In order to implement an application represented as a set of con- 
ditional process graphs as describe in Section 2.3, the designer 
has to map the processes to the system nodes and to derive a 
static cyclic schedule such that all deadlines are satisfied. We 
first illustrate some of the problems related to mapping and 
scheduling in the context of a system based on a TDMA communi- 
cation protocol, before going on to explore further aspects spe- 
cific to an incremental design approach. 

Example 5.1: Let us consider the example in Figure 5.1 
where we want to map an application consisting of four pro- 
cesses Pi to Pa, with a period and deadline of 50 ms. The 
architecture is composed of three nodes that communicate 
according to a TDMA protocol, such that Ni transmits in slot 
<Sj. For this example we suppose that there is no other previ- 
ous application running on the system. 

According to the specification, processes Pi and P^ are con- 
strained to node Ni, while P 2 and P 4 can be mapped on nodes 
N 2 or W 3 , but not Ni. The worst case execution times of pro- 
cesses on each potential node and the sequence and size of 
TDMA slots, are presented in Figure 5.L In order to keep the 
example simple, we suppose that the message sizes are such 
that each message fits into one TDMA slot. 

We consider two alternative mappings. If we map P 2 and 
P 4 on the faster processor W 3 , the resulting schedule length 
(Figure 5.1a) will be 52 ms, which does not meet the dead- 
line. However, if we map P 2 and P 4 on the slower processor 
N 2 , the schedule length (Figure 5.1b) is 48 ms, which meets 
the deadline. Note, that the total traffic on the bus is the 
same for both mappings and the initial processor load is 0 on 
both N 2 and N^. 

This result has its explanation in the impact of the commu- 
nication protocol. P 3 cannot start before receiving messages 
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m2 3 and m4 3. However, slot S2 corresponding to node N2 pre- 
cedes in the TDMA round slot S3 on which node IV3 communi- 
cates. Thus, the messages which P3 needs are available 
sooner in the case P2 and P4 are, counter-intuitively, mapped 
on the slower node. 

■ 

But finding a valid schedule is not enough if we are to support 
an incremental design process as discussed in the introduction. 
In this case, starting from a valid design, we have to improve the 
mapping and scheduling so that not only the design constraints 
are satisfied, but also there is a good chance that, later, new 
functionality can easily be mapped on the resulted system. 

Example 5.2; To illustrate the role of mapping and schedul- 
ing in the context of an incremental design process, let us 
consider the example in Figure 5 . 2 . For simplicity, we con- 
sider an architecture consisting of a single processor. The 
system is currently running application \|i(Figure 5 . 2 a). 

At a particular moment application F 4 has to be imple- 
mented on top of \|/ Three possible implementation alterna- 
tives for r 4 are depicted in Figure 5.2b4, 5.2C4, and 5.2d4. All 
three are meeting the imposed time constraint for F4. 

At a later moment, application F 2 has to be implemented 
on the system running \[rplus F4. If F4 has been implemented 
as shown in Figure 5.2b4, there is no possibility to map appli- 
cation F2 on the given system (in particular, there is no time 
slack available for process P7). If F4 has been implemented as 
in Figure 5.2c4 or 5.2d4, F 2 can be correctly mapped and 
scheduled on top of \|iand F4. 

■ 

There are two aspects which should be highlighted based on 
this example: 

1 . If application F4 is implemented like in Figure 5.2c4 or 5.2d4, 
it is possible to implement F2 on top of the existing system, 
without performing any modifications on the implementa- 
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Application \|/ Application T, Application T o 

tp^ = tp^ = 10 ms tpg = tp^ = tpg = 10 ms tp^ = 10 ms; tp^ = 30ms 

T^|,= D^=90ms Tp^ = 90 ms; Dpj = 80ms Tp^ = Dp 2 = 90ms 



80 90ms 




a) Initial system, running application \\f 

80 90ms 




bj) Application F ^ on top of \|jc alternative 




^ 2 ) Application F 2 on top of the 1 ®^ alternative: 



P 7 cannot be mapped. 



80 90ms 




Cj) Application F^ on top of 2 ^^ alternative 




C 2 ) Application F 2 on top of the 2 ^^ alternative: 
successful implementation. 

80 90ms 




dj) Application F j on top of \j;! 3‘‘‘* alternative 




da) Application Fa on top of the 3’''* alternative: 
successful implementation. 
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tion of previous applications. This could be the case if, during 
implementation of Fj, the designers have taken into consid- 
eration the fact that, in future, an application having the 
characteristics of Fg will possibly be added to the system. 

2. If Fj has been implemented like in Figure 5.2hi, F 2 can be 
added to the system only after performing certain modifica- 
tions on the implementation of F^ and/or \|X In this case, of 
course, it is important to perform as few as possible modifica- 
tions on previous applications, in order to reduce the devel- 
opment costs. 

5.2 Incremental Mapping and Scheduling 

Our goal is to map and schedule an application on a sys- 

tem that already implements a set \|/of applications, considering 
the following requirements: 

Requirement a All constraints on Fgy,.,,g„j are satisfied and 
minimal modifications are performed to 
the applications in \|/ 

Requirement b New applications T future be mapped on 

top of the resulting system. 

In order to achieve our goal we need certain information to be 
available concerning the set of applications \|/as well as the pos- 
sible future applications What exactly we have to know 

about existing applications has been outlined in Section 2.3.2, 
while the characterization of future applications will be dis- 
cussed in the next section. In Section 5.3 we then introduce the 
quality metrics which will allow us to give a more rigorous for- 
mulation of the problem we are going to solve. 

The processes in application Fg„,.^g„^ can interact with the previ- 
ously mapped applications \|/ by reading messages generated on 
the bus by processes in \\i In this case, the reading process has to 
be S5mchronized with the arrival of the message on the bus. 
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which is easy to model as an additional time constraint on the 
particular receiving process. This constraint is then considered 
(as any other deadline) during scheduling of 

5.2.1 Characterizing Future Applications 

What do we suppose to know about the family F of applica- 
tions which do not exist yet? Given a certain limited application 
area (e.g., automotive electronics), it is not unreasonable to 
assume that, based on the designers’ previous experience, the 
nature of expected future functions to be implemented, profiling 
of previous applications, available incomplete designs for future 
versions of the product, etc., it is possible to characterize the 
family of applications which possibly could be added to the cur- 
rent implementation. This is an assumption which is basic for 
the concept of incremental design. 

Hence, we consider that, with respect to the future applica- 
tions, we know the set ..., ti , ..., of possible worst- 

case execution times for processes, and the set 
bj, ..., b^^J of possible message sizes. We also assume that over 
these sets we know the distributions of probability for t e S( 
and fsf^ib) for b e Sf^. 

Example 5.3: For example, we might have predicted possi- 
ble worst-case execution times of different processes in 
future applications »Sp{50, 100, 200, 300, 500 ms}. If there is 
a higher probability of having processes of 100 ms, and a 
very low probability of having processes of 300 ms and 500 
ms, then our distribution function fsfit) could look like this: 
/s^(50) = 0.20, /s/100) = 0.50, /g/200) = 0.20, /s/300) = 0.05, 
and /s/500) = 0.05. 

■ 

Another information concerning the future applications is 
related to the period of the constituent process graphs. In partic- 
ular, the smallest expected period is assumed to be given, 
together with the expected necessary processor time and 
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bus bandwidth 6„gg^, inside such a period As will be shown 
later, this information is treated in a flexible way during the 
design process and is used in order to provide a fair distribution 
of available resources. 

The execution times in Sf, as well as are considered rela- 
tive to the slowest node in the system. All the other nodes are 
characterized by a speedup factor relative to this slowest node. A 
normalization with these factors is performed when computing 
the metrics Cf and C 2 introduced in the following section. 



5.3 Quality Metrics and Objective Function 

A designer will be able to map and schedule an application T 
on top of a system implementing \\f and T current if there are 
sufficient resources available. For the discussion in this chapter, 
the resources which we consider are processor time and the 
bandwidth on the bus. In the context of a non-preemptive static 
scheduling policy, having free resources translates into having 
free time slots on the processors and having space left for 
messages in the bus slots. We call these free slots of available 
time on the processor or on the bus, slack. 

It is to be noted that the total quantity of computation and 
communication power available on our system after we have 
mapped and scheduled T current f^p of \|/is the same regardless 
of the mapping and scheduling policies used. What depends on 
the mapping and scheduling strategy is the distribution of 
slacks along the time line and the size of the individual slacks. It 
is exactly this size and distribution of the slacks that character- 
izes the quality of a certain design alternative from the point of 
view of flexibility for future upgrades. 

In this section we introduce two criteria in order to reflect the 
degree to which a design alternative meets the requirement b 
presented in Section 5.2. For each criterion we provide metrics 
which quantify the degree to which the criterion is met. The first 
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criterion reflects how well the resulted slack sizes fit to a future 
application, and the second criterion expresses how well the 
slack is distributed in time. 

5.3.1 Slack Sizes (the first criterion) 

The slack sizes resulted after the implementation of T on 
top of \j/ should be such that they best accommodate a given fam- 
ily of applications Y future^ characterized by the sets S^, and the 
probability distributions and as outlined in Section 5.2.1. 

Example 5.4: Let us go back to the example in Figure 5.2 
where F is what we now call while F 2 , to be later 

implemented on top of v|/and F is This F consists 

of the two processes Pg and P 7 . It can be observed that the 
best configuration from the point of view of accommodating 
F fufure> taking into consideration only slack sizes, is to have a 
contiguous slack after implementation of 
(Figure 5.2di). However, in reality, it is almost impossible to 
map and schedule the current application such that a contig- 
uous slack is obtained. Not only is it impossible, but it is also 
undesirable from the point of view of the second design crite- 
rion, to be discussed next. However, as we can see from 
Figure 5.2bi, if we schedule such that it fragments too 

much the slack, it is impossible to fit F because there is 
no slack that can accommodate process Py. A situation as the 
one depicted in Figure 5.2c^ is desirable, where the resulted 
slack sizes are adapted to the characteristics of the F future 
application. 

■ 

In order to measure the degree to which the slack sizes in a 
given design alternative fit the future applications, we provide 
two metrics, Cf and C^. Cf captures how much of the largest 
future application, which theoretically could be mapped on the 
system, can be mapped on top of the current design alternative, 
is similar, relative to the slacks in the bus slots. 
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How does the largest future application which theoretically 
could be mapped on the system look like? The total processor 
time and bus bandwidth available for this largest future appli- 
cation is the total slack available on the processors and bus, 
respectively, after implementing Process and message 

sizes of this hypothetical largest application are estimated 
knowing the total size of the available slack, and the character- 
istics of the future applications as expressed by the sets and 
Sfy, and the probability distributions fg^ and 

Example 5.5: Let us consider, for example, that the total 
slack size on the processors is of 2800 ms and the set of possi- 
ble worst case execution times is Sf = {50, 100, 200, 300, 500 
ms}. The probability distribution function is defined as fol- 
lows: /s/50) = 0.20, /s/100) = 0.50, /s/200) = 0.20, /g/SOO) = 
0.05, and fg^(500) = 0.05. Under these circumstances, the 
largest hypothetical future application will consist of 20 pro- 
cesses: 10 processes (half of the total, /s^(lOO) = 0.50) with a 
worst case execution time of 100 ms, four processes with 50 
ms, four with 200 ms, one with 300 and one with 500 ms. 

■ 

After we have determined the number of processes of this 
largest hypothetical F and their worst-case execution times, 
we apply a bin-packing algorithm [Mar90] using the best-fit pol- 
icy in which we consider processes as the objects to be packed, 
and the available slacks as containers. The total execution time 
of processes which are left unpacked, relative to the total execu- 
tion time of the whole process set, gives the metric Cf. The same 
is the case with the metric Cf, but applied to message sizes and 
available slacks in the bus slots. 

Example 5.6: Let us consider the example in Figure 5.2 
and suppose a hypothetical F consisting of two processes 
like those of application F 3. For the design alternatives in 
Figure 5.2ci and 5.2d^, Cf = 0% (both alternatives are per- 
fect from the point of view of slack sizes). For the alternative 
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in Figure 5.2bj, however, Cf= 30 / 40 = 75% — the worst case 
execution time of Prj (which is left unpacked) relative the 
total execution time of the two processes. 

■ 

5.3.2 Distribution of Slacks (the second criterion) 

In the previous section we have defined a metric which captures 
how well the sizes of the slacks fit a possible future application. 
A similar metric is needed to characterize the distribution of 
slacks over time. 

Let Pj be a process with period Tj that belongs to a future 
application, and M(Pj) the node on which Pj will be mapped. The 
worst case execution time of Pj on node M(Pj) is Cj. In order to 
schedule Pj we need a slack of size Cj that is available periodi- 
cally, within a period Tj, on processor M(Pj). If we consider a 
group of processes with period T, which are part of F f^ure^ in 
order to implement them, a certain amount of slack is needed 
which is available periodically, with a period T, on the nodes 
implementing the respective processes. 

During the implementation oiT we aim for a slack distri- 
bution such that the future application with the smallest 
expected period T„jj„ and with the necessary processor time 
and bus bandwidth be accommodated (see 

Section 5.2.1). 

Thus, for each node, we compute the minimum periodic slack, 
inside a P„jj„ period. By summing these minima, we obtain the 
slack which is available periodically to ^ future- This is the C 2 met- 
ric. The metric characterizes the minimum periodically 
available bandwidth on the bus and it is computed in a similar 
way. 

Example 5.7: In Figure 5.3 we consider an example with 
T„jj„ = 120 ms, = 90 ms, and = 65 ms. The length of 
the schedule table of the system implementing \|/and 
is 360 ms (in Section 5.4 we will elaborate on the length of 
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the global schedule table). Consequently, we have to investi- 
gate three periods of length each. The system consists of 
three nodes. 

Let us consider the situation in Figure 5.3a. In the first 
period, Period 1 , there are 40 ms of slack available on node 
N-^, in the second period 80 ms, and in the third period no 
slack is available on N-^. Hence, the total slack a future appli- 
cation of period can use on node N-^ is min(40, 80, 0) = 0 

ms. Neither can node N 2 provide slack for this application, as 
in Period 1 there is no slack available. However, on node 
there are at least 40 ms of slack available in each period. 
Thus, with the configuration in Figure 5.3a we have Cf = 40 
ms, which is not sufficient to accommodate = 90 ms. The 
available periodic slack on the bus is also insufficient: 

= 60 ms < 

However, in the situation presented in Figure 5.3b, we 
have C 2 = 120 ms > ^2 = 90 ms > which 

means that enough resources are available, periodically, for 
the application. 

■ 

5.3.3 Objective Function and Exact Problem Formulation 

In order to capture how well a certain design alternative meets 
the requirement b stated in Section 5.2, the metrics discussed 
before are combined in an objective function, as follows: 

P P 2 YYi rn 2 

C = w^{C\) +w^{C^) + (5.1) 

P P TTL TYt 

W2 maxiO, -C^) + w^ max(0, ) 

where the metric values introduced in the previous section are 
weighted by the constants icf, 1 V 2 , w^, and w^. Our mapping 
and scheduling strategy will try to minimize this function. 

The first two terms measure how well the resulted slack sizes 
fit to a future application (the first criterion), while the second 
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two terms reflect the distribution of slacks (the second criterion). 
In order to obtain a balanced solution, that favors a good fitting 
both on the processors and on the bus, we have used the squares 
of the metrics. 

We call a valid solution one with a mapping and scheduling 
which satisfies all the design constraints (in our case the dead- 
lines) and meets the second criterion (Cf ^ ^reeed ^2 - ^need)^- 

At this point we can give an exact formulation of our problem: 
Given an existing set of applications x|/ which are already 
mapped and scheduled, and an application ^current be imple- 
mented on top of \)/ we are interested to find that subset Q c \|/of 
old applications to be remapped and rescheduled such that we 
produce a valid solution for u Qand the total cost of mod- 

ification i2(Q) is minimized (see Section 2.3.2 for the details con- 
cerning the modification cost of an application). Once such a set 
Q of applications is foimd, we are interested to optimize the 
implementation Qsuch that the objective function C 

(Equation 5.1) is minimized, considering a family of future 
applications characterized by the sets Sf and the functions 
and fsf, as well as the parameters T^i^, and 

A mapping and scheduling strategy based on this problem for- 
mulation is presented in the following section. 



5 . 4 Mapping and Scheduling Strategy 

As shown in the algorithm in Figure 5.4, our mapping and 
scheduling strategy (MS) consists of two steps. In the first step 
(lines 1-14) we try to obtain a valid solution for the mapping and 
scheduling of u Q so that the modification cost i?(Q) is 



1. This definition of a valid solution can be relaxed by imposing only the 
satisfaction of deadlines. In this case, the mapping and scheduling 
algorithm in Figure 5.4 will look after a solution which satisfies the 
deadlines and minimizes R{Q)\ the additional second criterion is, in this 
case, only considered optionally. 
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minimized. Starting from such a solution, the second step (lines 
17-20) iteratively improves the design in order to minimize the 
objective function C. In the context in which the second criterion 
is satisfied after the first step, improving the cost function dur- 
ing the second step aims at minimizing the value of 

IPl(Cj) (Cj ) . 

If the first step has not succeeded in finding a solution such 
that the imposed timing constraints are satisfied, this means 
that there are not sufficient resources available to implement 
the application Thus, modifications of the system archi- 

tecture have to be performed before restarting the mapping and 
scheduling procedure. If, however, the timing constraints are 
met but the second design criterion is not satisfied, a larger 



MappingSchedulingStrategy 

1 Step 1 : try to find a valid solution that minimizes 72(Q) 

2 Find a mapping and scheduling of u Q on top of \|/\ Q so that: 

3 1 . constraints are satisfied; 

4 2. modification cost i?(Q) is minimized; 

5 3. the second criterion is satisfied: C 2 S t^eed snd > b^eed 

6 

7 if Step1 has not succeeded then 

8 if constraints are not satisfied then 

9 change architecture 

10 eise 

1 1 suggest new t^^^d or b,eed 

12 end if 

13 go to Step 1 

14 end if 

15 

16 

17 Step 2: improve the solution by minimizing objective function C 

18 Perform iteratively transformations which 

19 improve the first criterion (the metrics Cf and ) 

20 without invalidating the second criterion. 

21 

end MappingSchedulingStrategy 

Figure 5.4: The Mapping and Scheduling Strategy 
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(smallest expected period of a future application, see 
Section 5.2.1) or smaller values for and/or 6„gg^ are sug- 
gested to the designer (line 11). This, of course, reduces the fre- 
quency of possible future applications and the amount of 
processor and bus resources available to them. 

In the following section we briefly discuss the basic mapping 
and scheduling algorithm we have used in order to generate an 
initial solution. The heuristic used to iteratively improve the 
design with regard to the first and the second design criteria is 
presented in Section 5.4.2. In Section 5.4.3 we describe three 
alternative heuristics which can be used during the first step in 
order to find the optimal subset of applications to be modified. 

5.4.1 The Initial Mapping and Scheduling 

The first step of our mapping and scheduling strategy MS con- 
sists of an iteration that tries different subsets Q c \|rwith the 
intention to find that subset Q = 0 ,^ 1 ^ of old applications to be re- 
mapped and rescheduled which produces a valid solution for 
^current ^ such that is minimized. Given a subset Q, the 
InitialMappingScheduling function (IMS) constructs a mapping and a 
schedule for the applications u Q on top of \)/\ Q which 

meets the deadlines, without worr 3 dng about the two criteria 
introduced in Section 5.3. 

The IMS is a classical mapping and scheduling algorithm for 
which we have used as a starting point the Heterogeneous Crit- 
ical Path (HOP) algorithm, introduced in [Jor97]. The HCP is 
based on a list scheduling approach [Cof72]. We have modified 
the HCP algorithm in four main regards: 

1. The list scheduling approach that is used as a basis for HCP 
is considering applications modeled as conditional process 
graphs, as described in Section 4.2.1. 

2. We consider that mapping and scheduling does not start 
with an empty system but a system on which a certain num- 
ber of processes are already mapped. 
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3. Messages are scheduled into bus-slots according to the TDMA 
protocol. The TDMA-based message scheduling technique has 
been presented in Section 4.3. 

4. As a priority function for list scheduling we use, instead of 
the CP (critical path) priority function employed in [Jor97], 
the MPCP (modified partial critical path) function introduced 
in Section 4.3.2. The MPCP takes into consideration the par- 
ticularities of the communication protocol for calculation of 
communication delays. These delays are not estimated based 
only on the message length, but also on the time when slots, 
assigned to the particular node which generates the mes- 
sage, will be available. 

For the example in Figure 5.1, our initial mapping and sched- 
uling algorithm will be able to produce the optimal solution with 
a schedule length of 48 ms. 

However, before performing the effective mapping and sched- 
uling with IMS, two aspects have to be addressed. First, the pro- 
cess graphs Gj e u Q have to be merged into a single 

graph G^n^rent’ unrolling of process graphs and inserting 
dummy nodes as shown in Figure 5.5. The period Ta ^ of 
^current ©Qual to the least common multiplier of the periods Tq. 
of the graphs Gj. Dummy nodes (depicted as empty disks in 
Figure 5.5) represent processes with a certain execution time 
but are not mapped to any processor or bus. 

In addition, we have to consider during scheduling the mis- 
match between the periods of the already existing system and 
those of the current application. The schedule table into which 
we would like to schedule has a length of which is 

the global period of the system \|r after extraction of the applica- 
tions in Q However, the period T^urrent of ^current oun be different 
from ^\|An Hence, before scheduling G^urrent the existing 
schedule table, the schedule table is expanded to the least com- 
mon multiplier of the two periods. A similar procedure is fol- 
lowed in the case T^^rrent > 
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Process graph: 

^ ^current 

period: 
deadline: Dq^ 





Process graph: 

Gt 2 e r current ^ ^ 



source 
O 




Merged process graph Gpurrent* 
period = deadline = 3 Tq^ 



Execution times 
of dummy processes: 





= Tgi 






= 2Tg, 






= 3Tq^ 


-Dgx 




= 2Tgj 


-Dg, 




= Tgi- 


Dg, 



Figure 5.5: Process Graph Merging Example 
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5.4.2 Iterative Design Transformations 

Once IMS has produced a mapping and scheduling which satis- 
fies the timing constraints, the next goal of Step 1 is to improve 
the design in order to satisfy the second design criterion 
(C'f ^ tneed and During the second step, the design is 

then further transformed with the goal of minimizing the value 
of w^{C^) + , according to the requirements of the first 

criterion, without invalidating the second criterion achieved in 
the first step. In both steps we iteratively improve the design 
using a transformational approach. These successive transfor- 
mations are performed inside the (innermost) repeat loops of the 
first (lines 11-19 in Figure 5.6) and second step (lines 31-38). A 
new design is obtained from the current one by performing a 
transformation called move. We consider the following two cate- 
gories of moves: 

1. moving a process to a different slack found on the same node 
or on a different node; 

2. moving a message to a different slack on the bus. 

In order to eliminate those moves that will lead to an infeasi- 
ble design (that violates deadlines), we do as follows. For each 
process Pj, we calculate the ASAPiP^) and ALAPiPi) times consid- 
ering the resources of the given hardware architecture. ASAPiP^) 
is the earliest time Pj can start its execution, while ALAP(Pj) is 
the latest time P^ can start its execution without causing the 
application to miss its deadline. When moving P^ we will con- 
sider slacks on the target processor only inside the [ASAP(P^), 
ALAP(Pj)] interval. The same reasoning holds for messages, with 
the addition that a message can only be moved to slacks belong- 
ing to a slot that corresponds to the sender node. Any violation of 
the data dependency constraints caused by a move is rectified by 
shifting processes or messages concerned in an appropriate way. 
If such a shift produces a deadline violation, the move is 
rejected. 
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1 Step 1 : try to find a valid solution that minimizes R{U) 

2 0=0 

3 repeat 

4 succeec/ed=lnitialMappingScheduling(\|/\ O, T^^rrent'^^ 

5 - compute ASAP-ALAP intervals for all processes 

6 ASAP(r current ALAP(r current 

1 — if time constraints are satisfied 

8 if succeeded then 

9 - design transformations in order to satisfy 

10 - the second design criterion 

11 repeat 

12 - find set of moves with the highest potential 

13 "to maximize C2 or 

14 move_set= PoXenWaMoyeC^iT current ^ 

15 PotentialMoveC^(r^^^^g^^ u Q) 

16 -- select and perform move which improves most C2 

17 move = Se\ecXMoyeC 2 (move_sety, Perform{move) 

1 8 succeeded = > tceed and > bceed 

19 until succeeded or maximum number of iterations reached 

20 end if 

21 if succeeded and smallest so far then 

2 2 Qyralid = SOlution^giid = solution current 

23 end if 

24 - try another subset 

2 5 O = NextSubset(t^ 

2 6 until termination condition 

27 

2 8 Step 2 : improve the solution by minimizing objective function C 

2 9 SOlutiOPcurrent ~ ^\nin ~ ^Katid 

3 0 - design transformations in order to satisfy the first design criterioi 

31 repeat 

3 2 — find set of moves with highest potential to minimize Cf or Cj” 

33 move_set= PotentialMoveCi’(rcurren( ^nin) 

34 POtentialMOVeC)|"(r current i^in) 

• P P ^ T7l TTi ^ 

35 — select move which improves (C^ ) -i- (Cj ) and 

36 — does not invalidate the second criterion 

3 7 move = SelectMoveCi (move_set)-, Perform{move) 

3 8 until iv^(Cf)^ + w'^(Cl') has not changed or 
3 9 maximum number of iterations reached 

Figure 5.6: Step One and Two of the Mapping and 
Scheduling Strategy in Figure 5.4 
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At each step, our heuristic tries to find those moves that have 
the highest potential to improve the design. For each iteration a 
set of potential moves is selected by the PotentialMoveX functions. 
SelectMoveX then evaluates these moves with regard to the 
respective metrics and selects the best one to be performed. We 
now briefly discuss the four PotentialMoveX functions with the 
corresponding moves. 

PotentialMoveC 2 and PotentialMoveC^ 

Example 5.8: Consider Figure 5.3a on page 105. In Period 3 
on node N-^ there is no available slack. However, if we move 
process Pi with 40 ms to the left into Period 2, as depicted in 
Figure 5.3b, we create a slack in Period 3 and the periodic slack 
on node Ni will be min(40, 40, 40) = 40 ms, instead of 0 ms. 

■ 

Potential moves aimed at improving the metric Cf will be the 
shifting of processes inside their [ASAP, ALAP] interval in order to 
improve the periodic slack. The move can be performed on the 
same node or to the less loaded nodes. The same is true for mov- 
ing messages in order to improve the metric C^. For the 
improvement of the periodic bandwidth on the bus, we also con- 
sider movement of processes, trying to place the sender and 
receiver of a message on the same processor and, thus, reducing 
the bus load. 

PotentialMoveCi and PotentialMoveC^ 

The moves suggested by these two functions aim at improving 
the Cl metric through reducing the slack fragmentation. The 
heuristic is to evaluate only those moves that iteratively elimi- 
nate the smallest slack in the schedule. 

Example 5.9: Let us consider the example in Figure 5.7, 
where we have three applications mapped on a single proces- 
sor: \|/ consisting of Pi and P 2 , FcMrre/if> having processes P 3 , P 4 
and P 5 , and T future^ with Pg, P 7 and Pg. Figure 5.7 presents 
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three possible schedules; processes are depicted with rectan- 
gles, the width of a rectangle representing the worst case 
execution time of that process. The PotentialMoveCi functions 
start by identif5dng the smallest slack in the schedule table. 

In Figure 5.7a, the smallest slack is the slack between 
and P3. Once the smallest slack has been identified, potential 
moves are investigated which either remove or enlarge the 
slack. For example, the slack between and P3 can be 
removed by attaching P3 to Pj, and it can be enlarged by 
moving P3 to the right in the schedule table. Moves that 
remove the slack are considered only if they do not lead to an 
invalidation of the second design criterion, measured by the 
C2 metric improved in the previous step (see Figure 5.6, Step 
1). Also, the slack can be enlarged only if it does not create, 
as a result, other unusable slack. A slack is unusable if it 



V 

P 

* current 




Ps 



r 



future 





Pa cannot be mapped; 
move Pa to start from 20 
Pg cannot be mapped; 
move P 5 to start from 90 

Successful implementation 



a) 

Smallest slack: between and P3 
Potential moves: P3 starting at 20, 
having Cf = 50% (denoted with 20/50%), 
30/50%, 40/50%, 50/50%. 

Selected move: P 3 to 20, with Cf= 50%. 



b) 

Smallest slack: between P5 and P2 
Potential moves: P5 to 90/0%, 100/0%, 110/ 
50%, 130/50%, 140/50%, 150/0%, 160/0%. 

Selected move: P5 to 90 with Cf= 100%. 



Figure 5.7: Successive Steps with Potential 
Moves for Improving the First Design Metric 
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cannot hold the smallest object of the future application, in 
our case Pq. 

In Figure 5.7a, the slack can be removed by moving P 3 
such that it starts from time 20 , immediately after P^, and it 
can be enlarged by moving Pg so that it starts from 30, 40, or 
50 (considering an increment which here was set by us to 10, 
the size of Pg, the smallest object in T For each move, 
the improvement on the metric is calculated, and that 
move is selected by the SelectMoveCi function to be per- 
formed, which leads to the largest improvement on Cj. For 
all the previously considered moves of Pg, we are not able to 
map Pg which represents 50% of the Y therefore 

Cl = 50%. Consequently, we can perform any of the men- 
tioned moves, and our algorithm selects the first one investi- 
gated, the move to start P 2 from 20 , thus removing the slack. 
As a result of this move, the new schedule table is the one in 
Figure 5.7b. 

In the next call to the PotentialMoveCi function, the slack 
between P 5 and P 2 is identified as the smallest slack. Out of 
the potential moves that eliminate this slack, listed in 
Figure 5.7 for case b, several lead to = 0 %, the largest 
improvement (no processes from F are left out, so 
Cj = 0%). SelectMoveCi selects moving P 5 to start from 90, 
and thus we are able to map process Pg of the future applica- 
tion, leading to a successful implementation in Figure 5.7c. 

■ 

The previous example has only illustrated movements of pro- 
cesses. Similarly, in PotentialMoveCT’, we also consider moves of 
messages in order to improve C^. However, the movement of 
messages is restricted by the TDMA bus access scheme, such that 
a message can only be moved in the same slot of another round. 
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5.4.3 Minimizing the Total Modification Cost 

The first step of our mapping and scheduling strategy, described 
in Figure 5.6, iterates on successive subsets Q searching for a 
valid solution, which also minimizes the total modification cost 
R{0) calculated using the Equation 2.1 in Section 2.3.2. As a 
first attempt, the algorithm searches for a valid implementation 
of without disturbing the existing applications (Q = 0). If 

no valid solution is found, successive subsets Q produced by the 
function NextSubset are considered, until a termination condition 
is met. The performance of the algorithm, in terms of runtime 
and quality of the solutions produced, is strongly influenced by 
the strategy employed for the function NextSubset and the termi- 
nation condition. They determine how the design space is 
explored while testing different subsets Q of applications. 

In the following sections we present three alternative strate- 
gies for the implementation of the NextSubset function. The first 
two can be considered as situated at opposite extremes: The first 
one is potentially very slow but produces the optimal result 
while the second is very fast and possibly low quality. The third 
alternative is a heuristic capable of producing good quality 
results in relatively short time, as will be demonstrated by the 
experimental results presented in Section 5.5.2. 

Exhaustive Search (ES) 

In order to find the simplest solution is to try successively 
all the possible subsets Q c \j/ These subsets are generated in 
the ascending order of the total modification cost, starting from 
0. The termination condition is fulfilled when the first valid 
solution is found or no new subsets are to be generated. Since 
the subsets are generated in ascending order, according to their 
cost, the subset Q. that first produces a valid solution is also the 
subset with the minimum modification cost. 



117 




Chapter 5 



The generation of subsets is performed according to the graph 
Jlthat characterizes the existing applications (see Section 2.3.2). 
Finding the next subset starting from the current one, is 
achieved by a branch and bound algorithm that, in the worst 
case, grows exponentially in time with the number of applica- 
tions. 

Example 5.10: For the example in Figure 2.6 on page 34, 
discussed in Section 2.3.2, the call to NextSubset(0) will gen- 
erate Q = {F 7 } which has the smallest nonzero modification 
cost i?({r 7 }) = 20. The next generated subsets, in order, 
together with their corresponding total modification costs 
are: ieClFg}) = 50, i^dFg, F 7 }) = 70, i?({r 4 , r 7 l) = 90 (the inclu- 
sion of r 4 triggers the inclusion of F 7 ), i?({F 2 , F g}) = 120 , iJdFg, 
Fg, F7}) = 140, i?({Fg, F4, F7}) = 140, i 2 ({Fi}) = 150, and so on. 
The total number of possible subsets according to the graph 
Jlin Figure 2.6 is 16. 

■ 

This approach, while finding the optimal subset Q requires a 
large amount of computation time and can be used only with a 
small number of applications. 

Ad-Hoc Selection Heuristic (AS) 

If the number of applications is larger, a possible solution could 
be based on a simple greedy heuristic which, starting from 
Q = 0 , progressively enlarges the subset until a valid solution is 
produced. The algorithm looks at all the non-frozen applications 
and picks that one which, together with its dependencies, has 
the smallest modification cost. If the new subset does not pro- 
duce a valid solution, it is enlarged by including, in the same 
fashion, the next application with its dependencies. This greedy 
expansion of the subset is continued until the set is large enough 
to lead to a valid solution or no application is left. 
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Example 5.11: For the example in Figure 2.6 the call to 
NextSubset(0) will produce R({T 7 }) = 20, and will be 
successively enlarged to R([r 7, F 3}) = 70, R{[T 7, F3, F 2}) = 140 
(F 4 could have been picked as well in this step because it has 
the same modification cost of 70 as F 2 and its dependency F 7 
is already in the subset), i?({F7, F3, F2, F 4}) = 210, and so on. 

■ 

While this approach finds very quickly a valid solution, if one 
exists, it is possible that the resulted total modification cost is 
much higher than the optimal one. 

Subset Selection Heuristic (SH) 

An intelligent selection heuristic should be able to identify the 
reasons due to which a valid solution has not been produced in 
the first step of the MS algorithm in Figure 5.6, and to find the 
set of candidate applications which, if modified, could eliminate 
the problem. 

The failure to produce a valid solution can have two possible 
causes: an initial mapping which meets the deadlines has not 
been found, or the second criterion is not satisfied. 

Let us investigate the first reason. If an application F j is to meet 
its deadline D^, all its processes € Fj have to be scheduled 
inside their [ASAP, ALAP] intervals. InitialMappingScheduling (IMS) 
fails to schedule a process inside its [ASAP, ALAP] interval, if 
there is not enough slack available on any processor, due to 
other processes scheduled in the same interval. In this situation 
we say that there is a conflict with processes belonging to other 
applications. We are interested to find out which applications 
are responsible for conflicts encountered during the mapping 
and scheduling of and not only that, but also which ones 

are flexible enough to be moved away in order to avoid these con- 
flicts. 

If it is not able to find a solution that satisfies the deadlines, 
IMS will determine a metric Ap . that characterizes both the degree 
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of conflict and the flexibility of each application g \j/in relation 

to T A set of applications Q will be characterized, in rela- 
tion to T by the following metric: 

A(Q) = (5.2) 

r,e 

This metric A(Q) will be used by our subset selection heuris- 
tic in the case IMS has failed to produce a solution which satisfies 
the deadlines. An application with a larger Ap. is more likely to 
lead to a valid schedule if included in Q 

Example 5.12: In Figure 5.8 we illustrate how this metric 
is calculated. Applications A, B and C are implemented on a 
system consisting of the three processors N-^, and N^. The 
current application to be implemented is D. At a certain 
moment, IMS comes to the point to map and schedule process 
e D. However, it is not able to place it inside its [ASAP, 
ALAP] interval, denoted in Figure 5.8 as /. The reason is that 
there is not enough slack available inside I on any of the pro- 
cessors, because processes Aj, Ag, A 3 e A, g B, and g C 
are scheduled inside that interval. We are interested to 
determine which of the applications A, B, and C are more 
likely to lend free slack for D-^, if remapped and rescheduled. 

Therefore, we calculate the slack resulted after we move 
away processes belonging to these applications from the 
interval I. For example, the resulted slack available after 
modifying application C (moving Cj either to the left or to the 
right inside its own [ASAP, ALAP] interval) is of size | / 1 - 
min( I Cf I , I Cf I ). With Cf (Cf ) we denote that slice of pro- 
cess Cj which remains inside the interval I after has been 
moved to the extreme left (right) inside its own [ASAP, ALAP] 
interval. | Cf [represents the length of slice Cf. Thus, when 
considering process D^, Aq will be incremented with 6^1 = 
max( I / 1 - min( | Cf | , | Cf | ) - | | , 0). This value shows the 

maximum theoretical slack usable for Di, that can be pro- 
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duced by modifying the application C. By relating this slack 
to the length of Dj, the value 5®i also captures the amount of 
flexibility provided by that modification. 

The increments 6g^i and to be added to the values of Ag 
and respectively, are also presented in Figure 5.8. IMS 
then continues the evaluation of the metrics A with the other 
processes belonging to the current application D (with the 
assumption that process has been scheduled at the begin- 
ning of interval I). Thus, as result of the failed attempt to 
map and schedule application D, the metrics A^, A^, and A^ 
will be produced. 




mapped on 

= max(max( 1 1 1 - | [ - min( | j , | | ), 

| I| - min(|A^|, |A| |)-^(|A^|, |A 1 |))P |Di|, 0 ) 

Dj mapped on N3 

8^1 =max(|I| - |Ai| -min(|BL|, |B^|)- |Di|, 0 ); 

8^1 =max(|Il -minCldfl, |C^|)- |Dil, 0 ) 

Figure 5.8: Metric for the Subset Selection Heuristic 
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If the initial mapping was successful, the first step of MS could 
fail during the attempt to satisfy the second criterion 
(Figure 5.6). In this case, the metric Ap. is computed in a differ- 
ent way. What Ap . will capture in this case, is the potential of an 
application Fj to improve the metric C 2 if remapped together 
with Therefore, we consider a total number of moves 

from all the non-frozen applications in \\i These moves are deter- 
mined using the PotentialMoveC 2 functions presented in 
Section 5.4.2. Each such move will lead to a different mapping 
and schedule, and thus to a different C 2 value. Let us consider 
^move the improvement on C 2 produced by the currently con- 
sidered move. If there is no improvement, ^rnove ~ 0- Thus, for 
each move that has as subject Pj or e F^, we increment the 
metric Ap , with the improvement on € 2 - 

As shown in the algorithm in Figure 5.6, MS starts by tr 5 ung 
an implementation of T current with Q = 0. If this attempt fails, 
because of one of the two reasons mentioned above, the corre- 
sponding metrics Ap^. are computed for all F^ e \|/ 

Our heuristic SH will then start by finding the solution 
produced with the greedy heuristic AS (this will succeed if there 
exists any solution). The total modification cost corresponding to 
this solution is = R(^^g) and the value of the metric A is 

^AS = 

SH now continues by trying to find a solution with a more 
favorable Q than (a smaller total cost R). Therefore, the 
thresholds R^ax ~ ^as ^min - ^AS ! ^ ^^or our experiments we 
considered n = 2) are set. Sets of applications not fulfilling these 
thresholds will not be investigated by MS. 

For generating new subsets the function NextSubset now fol- 
lows a similar approach like in the exhaustive search approach 
ES, but in a reverse direction, towards smaller subsets (starting 
with the set containing all non-frozen applications), and it will 
consider only subsets with a smaller total cost than and a 
larger A than (a small A means a reduced potential to elim- 
inate the cause of the initial failure). 
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Each time a valid solution is found, the current values of R^ax 
and are updated in order to further restrict the search 
space. The heuristic stops when no subset can be found with 
A > or a certain imposed limit has been reached (e.g., on the 
total number of attempts to find new subsets). 

5.5 Experimental Evaluation 

In the following three sections we show a series of experiments 
that demonstrate the effectiveness of the proposed approaches 
and algorithms. The first set of results is related to the efficiency 
of our mapping and scheduling algorithm and the iterative 
design transformations proposed in Sections 5.4.1 and 5.4.2. The 
second set of experiments evaluates our heuristics for minimiza- 
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tion of the total modification cost presented in Section 5.4.3. As 
a general strategy, we have evaluated our algorithms perform- 
ing experiments on a large number of test cases generated for 
experimental purpose. Finally, we have validated the proposed 
approaches using a real-life example. All experiments were run 
on a SUN ultra 10 workstation. 

5.5.1 IMS AND THE Iterative Design Transformations 

For the evaluation of our approach we used applications of 80, 
160, 240, 320 and 400 processes, representing the application 
^current’ generated for experimental purpose. Thirty applications 
were generated for each dimension, thus a total of 150 applica- 
tions were used for experimental evaluation. We considered an 
architecture consisting of ten nodes of different speeds. For the 
communication channel we considered a transmission speed of 
256 Kbps and a length below 20 meters. The maximum length of 
the data field in a bus slot was 8 bytes. Throughout the experi- 
ments presented in this section we have considered an existing 
set of applications ^/consisting of 400 processes, with a schedule 
table of 6 s on each processor, and a slack of about 50% of the 
total schedule size. In this section we have also considered that 
no modifications of the existing set of applications \|/are allowed 
when implementing a new application. We will concentrate on 
the aspects related to the modification of existing applications, 
in the following section. 

The first result concerns the quality of the designs produced 
by our initial mapping and scheduling algorithm IMS. As dis- 
cussed in Section 5.4.1, IMS uses the MPCP priority function 
which considers particularities of the TDMA protocol. In our 
experiments we compared the quality of designs (in terms of 
schedule length) produced by IMS with those generated with the 
original HCP algorithm proposed in [Jor97]. We have calculated 
the average percentage deviations of the schedule length pro- 
duced with HCP and IMS from the length of the best schedule 
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among the two. Results are depicted in Figure 5.9. In average, the 
deviation from the best result is 3.28 times smaller with IMS than 
with HCP. The average execution times for both algorithms are 
under half a second for graphs with 400 processes. 

For the next set of experiments we were interested to investi- 
gate the quality of the design transformation heuristics dis- 
cussed in Section 5.4.2, aiming at the optimization of the 
objective function C. In order to compare this heuristic, imple- 
mented in our mapping and scheduling approach MS, we have 
developed two additional heuristics: 

1. A Simulated Annealing strategy (SA) (see Appendix A), based 
on the same moves as described in Section 5.4.2. SA is ap- 
plied to the solution produced by IMS and aims at finding the 
near-optimal mapping and schedule that minimizes the ob- 
jective function C. The main drawback of the SA strategy is 
that in order to find the near-optimal solution it needs very 
large computation times. Such a strategy, although useful for 
the final stages of the system synthesis, cannot be used in- 
side a design space exploration cycle. 

2. A so csWed Ad-hoc Mapping approach (AM) which is a simple, 

straightforward solution to produce designs that, to a certain 
degree, support an incremental design process. Starting 
from the initial valid schedule of length S obtained by IMS for 
a graph G with N processes, AM uses a simple scheme to re- 
distribute the processes inside the [0, D] interval, where D is 
the deadline of process graph G. AM starts by considering the 
first process in topological order, let it be P^. It introduces af- 
ter a slack of size max(s/naZZes^ process size {D - 

S) IN), thus shifting all descendants of to the right (to- 
wards the end of the schedule table). The insertion of slacks 
is repeated for the next process, with the current, larger val- 
ue of S, as long as the resulted schedule has a length S 
Processes are moved only as long as their individual dead- 
lines (if any) are not violated. 
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Our heuristic (MS), proposed in Section 5.4.2, as well as SA and 
AM have been used to map and schedule each of the 150 process 
graphs on the target system. For each of the resulted designs, 
the objective function C has been computed. Very long and 
expensive runs have been performed with the SA algorithm for 
each graph and the best ever solution produced has been consid- 
ered as the near-optimum for that graph. We have compared the 
objective function obtained for the 150 process graphs consider- 
ing each of the three heuristics. Figure 5.10a presents the aver- 
age percentage deviation of the objective function obtained with 
the MS and AM from the value of the objective function obtained 
with the near-optimal scheme (SA). We have excluded from the 
results in Figure 5.10a, 37 solutions obtained with AM for which 
the second design criterion has not been met, and thus the objec- 
tive function has been strongly penalized. The average run- 
times of the algorithms are presented in Figure 5.10b. The SA 
approach performs best in terms of quality at the expense of a 
large execution time: The execution time can be up to 45 min- 
utes for large graphs of 400 processes. The important aspect is 
that MS performs very well, and is able to obtain good quality 
solutions, very close to those produced with SA, in a very short 
time. AM is, of course, very fast, but since it does not address 
explicitly the two design criteria presented in Section 5.3 it has 
the worst quality of solutions, as expressed by the objective func- 
tion. 

The most important aspect of the experiments is determining 
to which extent the design transformations proposed by us, and 
the related heuristic, really facilitate the implementation of 
future applications. To find this out, we have mapped applica- 
tions of 80, 160, 240 and 320 nodes representing the 
application on top of V|i (the same V|/ as defined for the previous 
set of experiments) . After mapping and scheduling each of these 
graphs we have tried to add a new application to the 

resulted system. consists of a process graph of 80 pro- 

cesses, randomly generated according to the following specifica- 
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tions: S^={20, 50, 100, 150, 200 ms},/’s/S^) = {0.1, 0.25, 0.45, 0.15, 
0.05}, Sf, = (2, 4, 6, 8 bytes), = (0.2, 0.5, 0.2, 0.1}, = 250 

ms, = 100 and = 20 ms. 

The experiments have been performed two times, using first 
MS and then AM for mapping In both cases we were inter- 

ested if it is possible to find a correct implementation for T 
on top of using the initial mapping and scheduling algo- 
rithm IMS (without any modification of \|ior Figure 5.11 

shows the percentage of successful implementations of in 

the two cases. In the case has been implemented with MS, 

this means using the design criteria and metrics proposed in 
this chapter, we were able to find a valid schedule for 65% of the 
total cases. However, using AM to map has led to a situa- 

tion where IMS is able to find correct solutions in only 21% of the 
cases. Another conclusion from Figure 5.11 is that when the 
total slack available is large, as in the case T current has only 80 
processes, it is easy for MS and, to a certain extent, even for AM to 
find a mapping that allows adding future applications. However, 




Figure 5.11: Percentage of Future Applications 
Successfully Implemented 
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as grows to 240 processes, only MS is able to find an 

implementation of T current that supports an incremental design 
process, accommodating the future application in more then 
60% of the cases. If the remaining slack is very small, after we 
map a of 320 processes, it becomes practically impossible 

to map new applications without modif 5 dng the current system. 

5.5.2 Modification Cost Minimization Heuristics 

For this set of experiments we first used the same 150 applica- 
tions as in the previous section, consisting of 80, 160, 240, 320 
and 400 processes, for the application We also considered 

the same system architecture as presented there. 

The first results concern the quality of the solution obtained 
with our mapping strategy MS using the search heuristic SH 
compared to the case when the simple greedy approach AS and 
the exhaustive search ES are used. For the existing applications 
we have generated five different sets v)/ consisting of different 
numbers of applications and processes, as follows: 6 applications 
(320 processes), 8 applications (400 processes), 10 applications 
(480 processes), 12 applications (560 processes), 14 applications 
(640 processes). Each application had an associated modification 
cost, assigned manually, in the range 10 to 100. The available 
slack is of about 50% of the total schedule size. The dependen- 
cies between applications (in the sense introduced in 
Section 2.3.2) were such that the total number of possible sub- 
sets Q resulted for each set \|iwere 32, 128, 256, 1024, and 4096, 
respectively. We have considered that the future applications, 
T future are characterized by the following parameters: S^. = {20, 
50, 100, 150, 200 ms}, = (0.1, 0.25, 0.45, 0.15, 0.05), = {2, 

4, 6, 8 bytes), = (0.2, 0.5, 0.2, 0.1), = 250 ms, 

ineed = 1^0 ms and = 20 ms. 

MS has been used to produce a valid solution for each of the 
150 applications representing T on each of the target con- 
figurations V|/ using the ES, AS and SH approaches to subset selec- 
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a) Modification cost obtained with the 
AS, SH, and ES heuristics 




b) Execution times 



Figure 5.12: Evaluation of the Modification Cost 
Minimization Heuristics 
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tion. Figure 5.12a compares the three approaches based on the 
total modification cost needed in order to obtain a valid solution. 
The exhaustive approach ES is able to obtain valid solutions 
with an optimal (smallest) modification cost, while the greedy 
approach AS produces in average 3.12 times more costly modifi- 
cations in order to obtain valid solutions. However, in order to 
find the optimal solution, ES needs large computation times, as 
shown in Figure 5.12b. For example, it can take more than 2 
hours in average to find the smallest cost subset to be remapped 
that leads to a valid solution in the case of 14 applications (640 
processes). We can see that the proposed heuristic SH performs 
well, producing close to optimal results with a good scaling for 
large application sets. For the results in Figure 5.12 we have 
eliminated those situations in which no valid solution could be 
produced by MS. 

Finally, we have repeated the last set of experiments dis- 
cussed in the previous section (the experiments leading to the 
results in Figure 5.11). However, in this case, we have allowed 
the current system (consisting of \|/ u to be modified 

when implementing F future- If mapping and scheduling heu- 
ristic is allowed to modify the existing system then we are able 
to increase the total number of successful attempts to imple- 
ment application T from 65% to 77.5%. For the case with 
^current Consisting of 160 processes (when the amount of available 
resources for is small) the increase is from 60% to 92%. 

Such an increase is, of course, expected. The important aspect, 
however, is that it is obtained not by randomly selecting old 
applications to be modified, but by performing this selection 
such that the total modification cost is minimized. 

5.5.3 The Vehicle Cruise Controller 

As a real-life case study, we have considered the cruise controller 
(CC) presented in Section 2.3.3 in order to evaluate our 
approaches. For the cruise controller we have used: 
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• the un-mapped model of the cruise controller presented in 
Figure 2.9 on page 40, with 32 processes and two conditions, 

• the hardware architecture in Figure 2.7a on page 37, consist- 
ing of five nodes interconnected using a bus implementing 
the time-triggered protocol, 

• and the software architecture for time-driven systems intro- 
duced in Section 3.3. 

• We have considered a transmission speed of the communica- 
tion channel of 256 Kbps and the frequency of the TTP con- 
troller was chosen to be 20 MHz. 

• The period of the CC was chosen to be 300 ms, equal to the 
deadline. 

The system \\i, representing the applications already running 
on the four nodes, has been modeled as a set of 80 processes with 
a schedule table of 300 ms and leaving a total of 40% slack. The 
CC is the application to be implemented. We have also 

generated 30 future applications of 40 processes each, with the 
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general characteristics close to those of the CC, which are typical 
for automotive applications. We have first mapped and sched- 
uled the CC on top of \\f using the ad-hoc strategy (AM) and then 
our MS algorithm. On the resulted systems, consisting of \|/u CC, 
we tried to implement each of the 30 future applications. First, 
we considered a situation in which no modifications of the exist- 
ing system are allowed when implementing the future applica- 
tions. In this case, we were able to implement 21 of the 30 future 
applications after implementing the CC with MS, while using AM 
to implement the CC, only 4 of the future applications could be 
mapped. When modifications of the current system were 
allowed, using MS, we were able to map 24 of the 30 future appli- 
cations on top of the CC. 

As our experiments have shown, the design criteria proposed 
in this chapter are able to guide our mapping and scheduling 
approaches to implementations which support an incremental 
design process. This means that the modifications performed to 
the existing applications are minimized, and that new function- 
ality, later to be added, can be easily accommodated. 

The next part of the book will address the mapping and sched- 
uling in the context of event-driven systems. 
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Chapter 6 
Schedulability Analysis and 
Bus Access Optimization for 
Event-Driven Systems 



In the previous part of the book we have addressed the 
issue of non-preemptive static process scheduling and communi- 
cation synthesis using the TTP as the communication infrastruc- 
ture. 

In the third part of the book, consisting of this and the next 
chapter, we consider event-driven distributed real-time systems 
where the activation of processes is event-triggered, while the 
communications are time-triggered, according to the TTP. 

This chapter is structured as follows. The next section pre- 
sents background and related work in the area of schedulability 
analysis. In Section 6.2 we go into some details concerning the 
particular schedulability analysis technique that is used as a 
starting point in our later discussions. Section 6.3 presents the 
schedulability analysis we have developed for systems with both 
control and data dependencies modeled as a set of conditional 
process graphs. Section 6.4 shows how the current state-of-the- 
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art schedulability analysis for distributed real-time systems can 
be extended to consider tbe time triggered protocol. Once realis- 
tic communication aspects are captured by tbe scbedulability 
analysis, tbey can be used to drive tbe communication synthesis 
process described in Section 6.7. Finally, Section 6.8 presents 
tbe experimental results obtained for tbe approaches presented 
in this chapter. 



6.1 Background 

Preemptive scheduling of independent processes with static pri- 
orities running on single-processor architectures has its roots in 
the work of Liu and Layland [Liu73]. The approach has been 
later extended to accommodate more general computational 
models and has also been applied to distributed systems 
[Tin94a] . The reader is referred to [Aud95] , [Bal98] , [Sta93] for 
surveys on this topic. 

In [Yen97] performance estimation is based on a preemptive 
scheduling strategy with static priorities using rate monotonic 
analysis. In [Lee99] an earlier deadline first strategy is used for 
non-preemptive scheduling of processes with possible data 
dependencies. Preemptive and non-preemptive static scheduling 
are combined in the co-synthesis environment described in 
[Dav98], [Dav99]. 

In many of the previous scheduling approaches researchers 
have assumed that processes are scheduled independently. How- 
ever, this is not the case in reality, where process sets can exhibit 
both data and control dependencies. Moreover, knowledge about 
these dependencies can be used in order to improve the accuracy 
of schedulability analyses and the quality of the produced sched- 
ules. 

One way of dealing with data dependencies between processes 
with static priority based scheduling has been indirectly 
addressed by the extensions proposed for the schedulability 
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analysis of distributed systems through the use of the release jit- 
ter [Tin94a]. Release jitter is the worst case delay between the 
arrival of a process and its release (when it is placed in the 
ready-queue for the processor) and can include the communica- 
tion delay due to the transmission of a message on the commu- 
nication channel. 

In [Tin94b] and [Yen98] time offset relationships and phases, 
respectively are used in order to model data dependencies. Off- 
set and phase are similar concepts that express the existence of 
a fixed interval in time between the arrivals of sets of processes. 
The authors show that by introducing such concepts into the 
computational model, the pessimism of the analysis is signifi- 
cantly reduced when bounding the time behavior of the system. 
The concept of dynamic offsets has been later introduced in 
[Pal98] and used to model data dependencies [Pal99] . 

When control dependencies exist then, depending on condi- 
tions, only a subset of the set of processes is executed during an 
invocation of the system. Modes have been used to model a cer- 
tain class of control dependencies [Foh93]. Such a model basi- 
cally assumes that at the starting of an execution cycle, a 
particular functionality is known in advance and is fixed for one 
or several cycles until another mode change is performed. How- 
ever, modes cannot handle fine grained control dependencies, or 
certain combinations of data and control dependencies. Careful 
modeling using the periods of processes (lower bound between 
subsequent re-arrivals of a process) can also be a solution for 
some cases of control dependencies [Ger96]. If, for example, we 
know that a certain set of processes will only execute every sec- 
ond cycle of the system, we can set their periods to the double of 
the period of the rest of the processes in the system. However, 
using the worst case assumption on periods leads very often to 
unnecessarily pessimistic schedulability evaluations. More 
refined process models can produce much better schedulability 
results, as will be later shown in the book. Recent works 
[Bar98a] , [Bar98b] aim at extending the existing models to han- 
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die control dependencies. In [Bar98b] Baruah introduces the 
recurring real-time task model that is able to capture lower level 
control dependencies, and presents an exponential-time analy- 
sis for uniprocessor systems. 

As mentioned in Section 4.1, researchers have initially 
ignored communication aspects when analyzing real-time sys- 
tems. However, we have to mention here some results obtained 
in extending real-time schedulability analysis so that network 
communication aspects can be handled. In [Tin95], for example, 
the CAN protocol is investigated while the work reported in 
[Erm97] considers systems based on the ATM protocol. Analysis 
for a simple time-division multiple access (TDMA) protocol is pro- 
vided in [Tin94a] that integrates processor and communication 
schedulability and provides a “holistic” schedulability analysis 
in the context of distributed real-time systems. Other protocols 
have also been considered, like the Token Ring [Str89], and the 
FDDI network architecture [Agr94] . 

The problem of how to allocate priorities to a set of distributed 
processes is discussed in [Gut95]. Their priority assignment 
heuristic is based on the schedulability analysis from [Tin94a] . 

In this third part of the book we consider the time-triggered 
protocol (TTP) [KopOS] as the communication infrastructure for a 
distributed real-time system. However, the research presented 
is also valid for any other TDMA-based bus protocol that sched- 
ules the messages statically based on a schedule table like, for 
example, the SAFEbus [Hoy92] protocol used in the avionics 
industry. 



6.2 Response Time Analysis 

In this part of the book, we consider that processes are sched- 
uled according to a fixed-priority preemptive scheduling policy 
(FPS). This is the most widely used preemptive scheduling 
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approach, whereby each process has a fixed (static) priority 
which is computed ofi‘-line. The processes ready for execution 
are then executed according to their priority. 

6.2.1 Basic Concepts 

The aim of a schedulability analysis is to determine sufficient 
and necessary conditions under which an application is schedu- 
lable. An application is schedulable if there exists at least one 
scheduling algorithm that is able to produce a feasible schedule. 
A schedule is feasible if all processes can be completed within 
the specified constraints. 

There are basically two approaches to the schedulability anal- 
ysis in the context of fixed-priority preemptive scheduling: utili- 
zation-based tests, and response-time analysis. The utilization 
tests for EPS [Liu73], [BinOl], [Leh89] are not exact (i.e., are only 
necessary, but not both necessary and sufficient), and/or are not 
applicable to a more general process model, as we will introduce 
below. 

Thus, in this book we will use a response time analysis 
[Aud91] approach in order to check the exact feasibility of a set 
of processes. The approach has two steps: 

1. In the first step, the analysis derives the worst-case response 
time of each process (the time it takes from the moment is 
ready for execution, until it has finished executing). 

2. The second step compares the worst case response time of 
each process to its deadline and, if the response times are 
smaller or equal to the deadlines, the system is schedulable. 

Before going into the details of the response time analysis, let 
us present the basic concepts we will use, illustrated also in 
Figure 6.1 (in addition to those introduced in Section 2.3.1 for 
the application model): 
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•Arrival time, ai. the time when a process P^ becomes ready 
for execution. Also known as request time or release time. 

• Start time: the time when a process starts its execution. 

• Finishing time: the time when a process finishes its execu- 
tion. 

• Response time, rf the time it takes from the arrival of the 
process Pj, until it finishes executing. 

• Interference, f: the time a process Pj is interrupted by higher 
priority processes during its execution. 

• Blocking time, Pji the time a process Pj has to wait for lower 
priority processes that are in their critical section and cannot 
be interrupted. 

• Release jitter, Jj;. the delay between the arrival of process Pj 
and the start of its execution. 

• Offset, Of. the earliest possible arrival time of process Pj, rel- 
ative to the start of the schedule (also known as phase). 

• Transmission delay, C„^: the time it takes for a message m to 
reach the destination controller, once it has been sent on the 
bus (also known as propagation delay). 

• Queuing delay, W„f. is the delay experienced by m at the com- 
munication controller, from the time it was produced by the 
sender process, until is being sent. 

• Communication delay, r,^: is the time it takes for a message 
m to reach the desalination process, from the moment it has 
been produced by the sender process. It is also known as the 
end-to-end communication delay, or response time (similar, 
conceptually, to the response time of a process). 

6.2.2 Response Time Analysis Overview 

This section presents an overview of the response time analysis 
used in this book. We start with the basic response time analy- 
sis, as outlined in [Aud91]. In the next sections, we extend this 
analysis for applications with data and control dependencies, 
implemented on distributed architectures. 
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Figure 6.1; Illustration of Schedulability Concepts 
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Figure 6.2 presents an overview of the schedulability analysis 
techniques proposed in this chapter (the analyses in the grey 
boxes are our contribution). Basically, there are two approaches 
to extending the schedulability analysis. The first category, pre- 
sented in sections 6.2.4 and 6.3, have focused on reducing the 
pessimism of the analysis by using the information related to 
the data and control dependencies, respectively. Sections 6.4 and 
6.5 constitute the second category, which has extended the anal- 
ysis to handle distributed architectures, and the particularities 
of TTP and CAN protocols. 

The analyses presented are structured as follows: 

• Section 6.2.3 presents, as mentioned, the basic response time 
analysis for calculating the worst-case response time of a 
process Pj. This analysis does not take into account the data 
and control dependencies that can exist between processes, 
and it is applicable only to uni-processor systems. 

• Section 6.2.4 extends the previous analysis using the infor- 
mation about data dependencies, captured by the offsets, in 
order to reduce the pessimism^ of the analysis. Together with 
the analysis, in Section 6.2.4 we also present an algorithm 
(DelayEstimate in Figure 6.3) that derives values for offsets 
such that the schedulability of the application is improved. 

• Section 6.3 considers conditional process graphs, that cap- 
ture not only the dataflow but also the flow of control. The 
analysis in the Section 6.2 is extended to take into account 
the information related to the conditions, captured by a CPG, 
with the aim of reducing the pessimism of the analysis. 

• Section 6.4 further extends the analysis to consider applica- 
tions mapped on distributed architectures. It does this by 
considering that the release jitter J^ of a receiver process Pj 
depends on the communication delay of the incoming mes- 

1. An analysis A is less pessimistic than an analysis B if it indicates that 
an application, considered by B not to be schedulable, is, in fact, sched- 
ulable. 
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Figure 6.2: Overview of the Schedulability Analysis Approaches 
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sage m (also called response time of message m). In addi- 
tion, in Section 6.4 we show how the details of a 
communication protocol have to be considered when deter- 
mining the communication delay of a message. In particular, 
we present the extensions for a simple TDMA protocol and for 
the CAN bus. For each protocol, we calculate differently the 
transmission delay C^, and the worst-case queuing delay 
of a message m, needed to determine the communication 
delay, as expressed by Equation 6.6 on page 161. 

• Section 6.5 presents the analysis we have developed for the 
time-triggered protocol. It builds on, and extends, the analy- 
sis for a simple TDMA protocol presented in Section 6.4. Four 
approaches to the scheduling of event-triggered messages 
over the static TTP bus are presented, with their correspond- 
ing analysis for the communication delay (implying deriving, 
for each case, and W^). 



6.2.3 Basic Response Time Analysis 

As mentioned earlier, in order to find out if an application is 
schedulable, a response time analysis determines the worst-case 
response time of each process, and then compares it to its dead- 
line. If all response times are smaller than or equal to the dead- 
lines, then the application is schedulable. 

Thus, the response time analysis in [Aud91] uses the following 
equation for determining the worst-case response time of a 
process 



= C.-H 






I 

hp(Pd 






(6.1) 



where Cj is the worst-case execution time of process Pj, Tj is the 
period of process Pj, and hpiPi) denotes the set of processes that 
have a priority higher than the priority of Pj. 

The summation term, representing the interference 7^ of 
higher priority processes on Pj, increases monotonically in Tj, 
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thus solutions can be found using a recurrence relation. More- 
over, the recurrence relations that calculate the worst case 
response time are guaranteed to converge if the processor utili- 
zation is under 100%. 

The previously presented analysis assumes that the deadline 
of a process is smaller or equal to its period. This assumption 
has later been relaxed [Tin94a] to consider arbitrary deadlines 
(i.e., deadlines can be larger than the period) and release jitter 
(the delay between the arrival of a process and the start of its 
execution). Thus, the worst-case response time r^ of a process Pj 
becomes: 



r = max {J + w -{q) - qT ■) 

q = 0, 1 , 2 ... ‘ 



( 6 . 2 ) 



where q is the number of busy periods being examined, and Wiiq) 
is the width of the level-i busy period starting at time qPj. The 
level-i busy period is defined as the maximum time a processor 
executes processes of priority greater than or equal to the prior- 
ity of process Pj, and is calculated as [Tin94a] : 



Wi(q) + dj 



^j 



Cj . (6.3) 



w.(q) = (q + l)C.+B-+ ^ 

\^P, e hp(Pi) 

The schedulability analyses presented in the rest of the book 
work under the following assumptions: 



• All the processes belonging a process graph G have the same 
period Tq. For processes with different periods, a h 5 q)ergraph 
is constructed having the LCM of the periods of the processes 
involved. However, process graphs can have different peri- 
ods. 

• The offsets (see next section) are static (as opposed to 
dynamic [Pal98]), and are smaller than the period. 

• The deadlines are arbitrary and can be larger than the peri- 
ods. 
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6.2.4 SCHEDULABILITY ANALYSIS WITH DATA DEPENDENCIES 



The pessimism of the previous analysis can be reduced by using 
the information related to the precedence relations between pro- 
cesses. The basic idea is to exclude certain worst case scenarios, 
from the critical instant analysis, which are impossible due to 
precedence constraints. 

Methods for schedulability analysis of data dependent pro- 
cesses with static priority preemptive scheduling have been pro- 
posed in [Yen98], [Tin94b], [Pal98], [Pal99]. They use the 
concept oi offset (or phase), in order to handle data dependencies. 
[Tin94b] shows that the pessimism of the analysis is reduced 
through the introduction of offsets. The offsets have to be deter- 
mined by the designer. 

In their analysis [Tin94b], the response time of a process Pj is: 



J-. - max 

^ <? = 0 , 1 , 2 ... 



max 



Wi{q) + Oj + Jj 



(6.4) 



-T. 



G 
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[0- + J--0-J1 
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-Oj 
/ JJ 





where Tg the period of the process graph G, and Oj are offsets 

of processes Pj and P^, respectively, and Jj and Jj are the release 
jitters of Pj and Pj. In Equation (6.4), the level-i busy period 
starting at time qTQ [Tin94b] is: 



w-{q) = {q + l)C- + B^+I^- (6.5) 



In the previous equation, the blocking term Pj represents 
interference from lower priority processes that are in their criti- 
cal section and cannot be interrupted, and Cj represents the 
worst-case execution time of process Pj. The last term captures 
the interference 7j from higher priority processes in the applica- 
tion, including higher priority processes from other process 
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graphs. The reader is directed to [Tin94b] for the details of the 
interference calculation. 

Although this analysis is exact (both necessary and suffi- 
cient), it is computationally infeasible to evaluate. Hence, 
[Tin94b] proposes a feasible^ but not exact analysis (sufficient 
but not necessary) for solving Equation (6.4). Our implementa- 
tions use the feasible analysis provided in [Tin94b] for deriving 
the worst-case response time of a process Pj. 

The authors in [Yen98] provide a framework that iteratively 
finds the phases (offsets) for all processes, and then feeds them 
back into the schedulability analysis which in turn is used again 
to derive better phases. Thus, the pessimism of the analysis is 
iteratively reduced. 

Response Time Analysis Algorithm 

In [Yen98] an application is modeled as a set S of n process 
graphs Xj, i = 1, 2, ..., n. The application model assumed and the 
definition of a process graph is similar to our CPG, but without 
considering any conditions. The aim of the schedulability 
analysis in [Yen98] is to derive an as tight as possible worst case 
delay on the execution time of each of the process graphs in the 
application. This delay estimation is done using the algorithm 
DelayEstimate described in Figure 6.3. 

At the core of this algorithm is a worst case response time cal- 
culation based on offsets, similar to the analysis in [Tin94b]. 
Thus, in the LatestTimes function (called in line 8 of 
DelayEstimate), worst-case response times and upper bounds for 
the offsets are calculated, while the EarliestTimes function (line 9) 
calculates the lower bounds of the offsets. 

The LatestTimes function is a modified critical-path algorithm 
that calculates for each node of the graph the longest path to the 
sink node (see Section 4.2.2 for the definition of the critical 

1. The implementation of this feasible analysis is available at: 

ftp://ftp.cs.york.ac.uk/pub/realtime/programs/src/offsets/ 
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path). Hence, during the topological traversal of the graph x 
within LatestTimes, for each process Pi, the worst case response 
time Tj is calculated according to the Equation 6.4. This value is 
based on the values of the offsets known so far. Once an ri is cal- 
culated, it can be used to determine and update offsets for other 
successor processes. Accordingly, the EarliestTimes function deter- 
mines the lower bounds on the offsets. The influence on graph x 
from other graphs in the application is considered in both of the 
functions mentioned earlier. 

These calculations can be improved by realizing that for a pro- 
cess Pj, there might exist a process Pj mapped on the same pro- 



DelayEstimate(process graph x , application S) 

1 — derives the worst case delay of a process graph X considering 

2 — the influence from all other process graphs in the application S 

3 for each pair (P,, Pj) in x do 

4 maxsep[Pi, Pj\ = 

5 end for 

6 

7 repeat 

8 LatestTimes(x ) 

9 EarliestTimes(x) 

10 for each P, e x do 

11 MaxSeparations(P^ 

12 end for 

13 until maxsep is not changed or limit reached 

14 return the worst case delay 8^ of the graph x 
end DelayEstimate 

SchedulabilityTest(app//caf/OA7 S) 

1 - derives the worst case delay for each process graph in the system 

2 - and verifies if the deadlines are met 

3 for each process graph x , € S do 

4 DelayEstimate(x ,, S) 

5 end for 

6 if all process graphs meet their deadline then application S is schedulable 
end SchedulabilityTest 
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cessor, with priorityp. < prioritypj, such that their execution 
windows never overlap. In this case, the interference of Pj on the 
execution of Pj can be dropped, resulting in a tighter worst-case 
response time calculation. This situation is expressed through 
the so called maxsep table, computed by the MaxSeparations func- 
tion, whose value maxsep[P|, Pj] is less than or equal to 0 if the two 
processes never overlap during their execution (lines 10—12 of 
DelayEstimate). The term maxsep stands for maximum separation, 
an analysis modified from [Mc92] which builds the maxsep table 
based on the worst case execution times and offsets determined 
in EarliestTimes and LatestTimes. 

Having a better view on the maximum separation between 
each pair of processes, tighter worst case execution times and 
offsets can be derived, which in turn contribute to the update of 
the maxsep table. This iterative tightening process is repeated 
until there is no modification to the maxsep table, or a certain 
imposed limit on the number of iterations is reached (line 13). 

Finally, the DelayEstimate function returns the worst-case 
delay 5^ estimated for a process graph x , as the time when the 
sink node of x finishes its execution (line 14). Based on the 
delays produced by DelayEstimate, the function SchedulabilityTest in 
Figure 6.3 concludes on the schedulability of the application. 



6.3 Schedulability Analysis under Control and 
Data Dependencies 

In the previous sections we were interested to extend the basic 
schedulability analysis to handle data dependencies. 

In this section, we are interested further extend the analysis 
to handle not only data but also control dependencies. This 
means developing a schedulability analysis for an application 
modeled as a set of conditional process graphs. 

Example 6.1; To show the relevance of our problem, let us 
consider the example depicted in Figure 6.4, where we have 
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an application modeled as two conditional process graphs 
and G 2 with a total of 9 processes (processes Pq, Pg, Pq and 
P 12 are dummy processes and are not counted), and one con- 
dition. The processes are mapped on three different proces- 
sors as indicated by the shading, and the worst case 
execution time in milliseconds for each process on its respec- 
tive processor is depicted to the left of each node. Gj has a 
period of 200 ms, G 2 has a period of 150 ms. The deadlines 
are 100 ms on G^ and 90 ms on G 2 . 

Table 6.1 presents the worst-case delays of the two graphs. 
In the column labelled “no conditions” we have the results for 
the case when the analysis is applied to the set of processes, 
ignoring control dependencies. This results in a worst case 
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Table 6.1: Worst-Case Delays for the Application in Figure 6.4 



CPG 


Worst Case Delays 


no conditions 


conditions 


Gi 


120 


100 


G 2 


82 


82 



delay of 120 ms for and 82 ms for G 2 . Hence, the applica- 
tion is considered not to be schedulable. This analysis 
assumes as a worst case scenario the possible activation of 
all nine processes, during each execution of the application. 
This is the solution which will be obtained using a dataflow 
graph representation of the application. 

However, considering the CFG Gj in Figure 6.4, it is easy to 
observe that process P 3 on the one side and processes P 2 and 
P 4 on the other side will not be activated during the same 
period of G^. Making use of this information for the analysis 
we obtain a worst case delay of 100 ms for G^, as shown in 
Table 6.1 in the column headed “conditions,” which indicates 
that the application is, in fact, schedulable. 

■ 

Section 2.3.1 has presented the conditional process graph rep- 
resentation. Before introducing our schedulability analysis for 
CPGs, we reinforce two concepts: the unconditional subgraphs 
and the process guards. 

Depending on the values calculated for the conditions, differ- 
ent alternative paths through a conditional process graph are 
activated for a given activation of the application. To model this, 
a logical expression Ap., called guard (introduced in 
Section 2.3.1), can be associated to each node Pj in the graph. It 
represents the necessary condition for the respective process to 
be activated. 
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Example 6.2: In Figure 6.5, for example, Xp^ = C a D, 

Xp^ = C, Xp^ = true, Xp^^ = true, and Xp^^ = K. 

■ 

We call an alternative path through a conditional process 
graph, resulting from a combination of conditions, an uncondi- 
tional subgraph, denoted byg. 

Example 6.3: The CPG in Figure 6.5 has three uncondi- 
tional subgraphs, corresponding to the following three com- 
binations of conditions: C a D, C a D, and C. The 
unconditional subgraph corresponding to the combination 
C A D in the CPG Gj consists of processes P^, P 2 , P 4 , Pq, Pj, Pg 
and PiQ. 

■ 

The guards of each process, as well as the unconditional sub- 
graphs resulting from a conditional process graph G, can be 
determined through a simple recursive topological traversal of G. 




Figure 6.5: Example of Two CPGs 
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In the following sections we present four approaches to the 
analysis of conditional process graphs. There are two extreme 
solutions to this problem; 

• The first one, called Ignoring Conditions (IC), ignores control 
dependencies and applies the schedulability analysis for the 
(unconditional) process graphs. 

• At the other end, the Brute Force Algorithm (BF) applies the 
schedulability analysis after each of the CPGs in the applica- 
tion have been decomposed in their constituent uncondi- 
tional subgraphs. 

The other two solutions proposed are in-between solutions: 

• Conditions Separation (CS) is similar to Ignoring Conditions, 
but uses the knowledge about the conditions in order to 
update the maxsep table: maxsep[P|, Pj] = 0 if processes Pj and 
Pj are on different conditional paths (see Section 6.2.4, 
Figure 6.3). 

• Relaxed Tightness Analysis (with two variants: RTl, RT2) is 
similar to the Brute Force Algorithm, but tries to reduce the 
execution time by removing the iterative tightening loop 
(hence the name relaxed tightness) in the DelayEstimation 
function in Figure 6.3. 

6.3.1 Ignoring Conditions (IC) 

A straightforward approach to the schedulability analysis of 
applications represented as CPGs is to ignore control dependen- 
cies and to apply the schedulability analysis as described in 
Section 6.2.4 (the algorithm SchedulabilityTest in Figure 6.3). 

This means that the conditional edges in the CPGs are consid- 
ered like simple edges and the conditions in the model are 
dropped (line 3 of the algorithm in Figure 6.6). What results is 
an application S consisting of simple process graphs x j, each one 
derived from a CPG Gi of the given application F. The application 
S can then be analyzed (line 4) using the algorithm in 
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Figure 6.3. It is obvious that if the application -S is schedulable, 
the application T is also schedulable (line 5). 

This approach, which we call IC, is, of course, very pessimistic. 
However, this is the current practice when worst-case arrival 
periods are considered and classical data flow graphs are used 
for modeling and scheduling [Yen98], [Tin94b]. 

6.3.2 Brute Force Solution (BF) 

The pessimism of the IC approach can be reduced by considering 
the conditions captured by a conditional process graph model. A 
simple, brute force solution is to apply the schedulability analy- 
sis presented in Section 6.2.4, after the CPGs have been decom- 
posed into their constituent unconditional subgraphs. 

Consider an application F which consists of n CPGs Gj, i = 1, 
2, n. Each CPG Gj can be decomposed into Ui unconditional 
subgraphs ,j= 1, 2, rii. In Figure 6.5, for example, we have 
three unconditional subgraphs g^, g|, g\ derived from Gj and 
two, gf , g| derived from G 2 . 

At the same time, each CPG Gj can be transformed into a sim- 
ple process graph Xj, by transforming conditional edges into ordi- 
nary ones and dropping the conditions. When deriving the worst 
case delay on Gj we apply the analysis from Section 6.2.4 (algo- 
rithm DelayEstimate in Figure 6.3) separately to each uncondi- 
tional subgraph gj in combination with the graphs (x X 2 , ... x j_i, 
Xj^i, x„). This means that we consider each alternative path from 



SN\C{appHcation F) 

1 — verifies the schedulability of a system consisting of a set of 

2 — conditional process graphs 

3 transform each G, e F into the corresponding x , e S 

4 SchedulabilityTest(S) 

5 if S is schedulable then application F is schedulable 
end SA/IC 



Figure 6.6: Schedulability Analysis Ignoring Conditions 
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Gj in the context of the application, instead of the whole sub- 
graph Tj as in the previous approach. This is described by the 
algorithm DE/CPG in Figure 6.7a. The schedulability analysis is 
then based on the delay estimation for each CPG as shown in the 
algorithm SA/BF in Figure 6.7b. 

Such an approach, we call it BF, while producing tight bounds 
on the delays, can be expensive from the runtime point of view, 
because it is applied for each unconditional subgraph. In gen- 
eral, the number of unconditional subgraphs can grow exponen- 
tially. However, for many of the practical systems this is not the 



DE/CPG(CPG G, application S) 

1 - derives the worst case delay of a CPG G considering 

2 - the influence from all other process graphs in the application S 

3 extract all unconditional subgraphs gyfrom G 

4 for each py € G do 

5 DelayEstimate(gfy, S) 

6 end for 

7 return the largest of the delays, which is 

the worst case delay 8^- of CPG G 
end DE/CPG 

a) DE/CPG: Delay estimation for conditional process graphs 
SAJBf {application F) 

1 — verifies the schedulability of a system consisting of a set F of 

2 - conditional process graphs 

3 transform each G, e F into the corresponding x , e S 

4 for each G, € F do 

5 DE/CPG(G|, {Xi, %2, ...X,_i, X,^i, X„}) 

6 end for 

7 if all CPGs meet their deadline then the application F is schedulable 
end SA/BF 



b) SA/BF: Schedulability analysis: the brute force approach 
Figure 6.7: Brute Force Schedulability Analysis 
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case, and the brute force method can be used. Alternatively, less 
expensive methods, like those presented next, can be applied. 

6.3.3 Condition Separation (CS) 

In some situations, the explosion of unconditional subgraphs 
makes the brute force method inapplicable. Hence, we need to 
find an analysis that is situated somewhere between the two 
alternatives IC and BF, which means its should not be too pessi- 
mistic and should run in acceptable time. 

A first idea is to go back to the DelayEstimate algorithm in 
Figure 6.3, and use the knowledge about conditions in order to 
update the maxsep table. If two processes Pj and Pj never overlap 
their execution because they execute under alternative values of 
conditions, then we can update maxsep[P|, Pj] to 0, and thus, 
improve the quality of the delay estimation. Two processes 
and Pj never overlap their execution if there exists at least one 
condition C, so that C c Ap. (Ap. is the guard of process Pj) and 
C e Xp. (lines 19-23 in Figure 6.8). 

In this approach, called CS, we practically use the same algo- 
rithm as for ordinary process graphs and try to exploit the infor- 
mation captured by conditional dependencies in order to exclude 
certain influences during the analysis. In Figure 6.8 we show 
the algorithm SA/CS which performs the schedulability analysis 
based on this heuristic. 

6.3.4 Relaxed Tightness Analysis (RT) 

The two alternatives of the RT approach discussed here are sim- 
ilar to the brute force algorithm in Figure 6.7. However, they try 
to improve on the execution time of the analyses by reducing the 
complexity of the DelayEstimate algorithm (Figure 6.3) which is 
called from the DE/CPG function, in line 5 (Figure 6.7a). This will 
reduce the execution time of the analysis, not by reducing the 
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SA/CS{application T) 

1 — verifies the schedulability of a system consisting of a set F of 

2 - conditional process graphs 

3 transform each G, e F into the corresponding x , € S 

4 and keep guard Xp. for each P, 

5 for each x , e S do 

6 - derives the worst case delay of a process graph X j 

7 - considering the influence from all other process graphs 

8 - in the system S 

9 for each pair (P,, Pj) in x , do 

10 maxsep[Pi, Pj] = o° 

11 end for 

12 

13 repeat 

14 LatestTimes(x ,) 

15 EarliestTimes(x j) 

16 for each P, € X , do 

17 MaxSeparations(P^ 

18 end for 

1 9 for each pair (P,, Py) in x , do 

20 if 3 C, Cc Xp. A Cc Xpythen 

21 maxsep[Pj, Pj ]=0 

22 end if 

23 end for 

24 untii maxsep is not changed or limit reached 

2 5 5 q. is the worst case delay for G, 

2 6 end for 

2 7 if all CPGs meet their deadline then the application F is schedulable 
end SA/CS 



Figure 6.8: Schedulability Analysis using 
Condition Separation 



number of subgraphs which have to be visited (like in the CS ap- 
proach), but by reducing the time needed to analyze each sub- 
graph. 

As our experimental results in Section 6.8 show, this approach 
can be very effective in practice. Of course, by the simplification 
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DelayEstimateRTI (process graph x, application S) 
1 LatestTimes(x) 
end DelayEstimateRTI 

a) Delay estimation for RTl 

Delay EstimateRT2 (process graph x, application S) 

1 for each pair (P,, Py) in x , do 

2 maxsep[Pj, Pj] = oo 

3 end for 

4 LatestTimes(x) 

5 EarliestTimes(x) 

6 for each P, e x do 

7 MaxSeparations(P/) 

8 end for 

9 LatestTimes(x) 
end DelayEstimateRT2 

b) Delay estimation for RT2 



Figure 6.9: Delay Estimation for the RT Approaches 



applied to DelayEstimate the quality of the analysis is reduced in 
comparison to the brute force method. 

We have considered two alternatives of which the first one is 
more drastic while the second one is trying a more refined trade- 
off between execution time and quality of the analyses. 

With both these approaches, the idea is not to run the itera- 
tive tightening loop in DelayEstimate that repeats until no 
changes are made to maxsep or until the limit is reached (lines 7- 
13 in Figure 6.3). While this tightening loop iteratively reduces 
the pessimism when calculating the worst case response times, 
the actual calculation of the worst case response times is done in 
LatestTimes, and the rest of the algorithm in Figure 6.3 just tries 
to improve on these values. For the first approach, called RTl the 
function DelayEstimate has been transformed like in Figure 6.9a. 
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However, it might be worth using at least the MaxSeparations in 
order to obtain tighter values for the worst case response times. 
For the alternative RT2 in Figure 6.9b, DelayEstimateRT2 first 
calls LatestTimes and EarliestTimes, then MaxSeparations in order to 
build the maxsep table, and again LatestTimes to tighten the worst 
case response times (lines 4-9 in Figure 6.9b). 



6.4 Schedulability Analysis for Distributed 
Systems 

The previous sections have shown how we can reduce the pessi- 
mism of the analysis by using information related to the data 
and control dependencies. In this section, we present an exten- 
sion of the response time analysis to handle applications distrib- 
uted on multi-processor architectures. 

Tindell et al. [Tin94a] integrate processor and communication 
scheduling and provide a “holistic” schedulability analysis in the 
context of distributed real-time systems. The validity of the 
analysis in has been later confirmed in [Pal97] . 

In the case of a distributed system the response time of a pro- 
cess also depends on the communication delay due to messages. 
In [Tin94a] the analysis for messages is done is a similar way as 
for processes: a message is seen as an un-preemptable process 
that is “running” on a bus. Thus, the same analysis can be 
applied for messages on a bus, rewriting Equation 6.1 to: 

where is the jitter of message m which in the worst case is 
equal to the response time of the sender process Ps(m) 
is the worst-case time it takes for message m to reach the 
destination controller. is the worst-case queuing delay expe- 
rienced by m at the communication controller, and is calculated 
as: 
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(6.7) 



where q is the number of busy periods being examined, and 
w^(q) is the width of the level-m busy period starting at time 
QTm- 

The response time analyses for processes and messages are 
combined by realizing that the jitter of a destination process 
depends on the communication delay between sending and 
receiving a message. Thus, for a process that receives a 

message m from a sender process Ps{m)’ release jitter is: 



J 



D(m) 



= r 



m 



( 6 . 8 ) 



For the communication infrastructure of our heterogeneous 
architectures, we use in this book the controller area network 
and the time-triggered protocols. In order to analyze systems 
implemented with these two protocols, we have to provide anal- 
yses that bound the worst-case queuing delay W„j and worst- 
case transmission delay for a message m. Tindell et al. 
[Tin95] provide an analysis for the CAN protocol, while in 
[Tin94a] an analysis for a simple TDMA protocol, sharing similar- 
ities with TTP, is presented. We present briefly these analyses, 
and later we will show how they can be extended to suit our par- 
ticular settings. 

6.4.1 SCHEDULABILITY ANALYSIS FOR THE CAN PROTOCOL 

Tindell et al. [Tin95] provide worst-case bounds for iv^ and C„j in 
the context of the CAN protocol. CAN is a priority bus, where the 
message having the highest priority on the network gets to be 
transmitted (see Section 3.2.2). 

In Figure 3.8 on page 60 node N 2 is part of a CAN network, and 
has a CAN controller. Messages waiting to become the highest 
priority on the network wait for their transmission in an outgo- 
ing queue denoted in the figure with Out^^. Thus, the worst-case 
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queuing delay for a message m is determined according to 
Equation (6.7) where the level-m busy period w^^iq) is: 



G hp(m) 





(6.9) 



The intuition is that m has to wait, in the worst case, first for 
the largest lower priority message that is just being transmitted 
(B^) as well as for the higher priority rrij e hp(m) messages that 
have to be transmitted ahead of m (the second term). In the 
worst case, the time it takes for the largest lower priority mes- 
sage € lp{m) to be transmitted to its destination is: 



B 



m 



max , 

e lp(m) ^ 



( 6 . 10 ) 



Once m is sent, the time it takes to transmit it to the des- 
tination controller depends on the frame configuration intro- 
duced in Section 3.2.2, message size and the time it takes 
to transmit a bit [Tin95] : 






34 + 8s 



m 



+ 47 + 8s kbit 



( 6 . 11 ) 



6.4.2 SCHEDULABILITY ANALYSIS FOR A TDMA BUS 

In this part of the book we consider that, although the messages 
are produced based on events, they are transmitted using the 
time-triggered protocol. TTP has a time-division multiple access 
scheme to the bus, meaning that a message produced on node 
can be transmitted only during a predetermined time interval, 
the slot Sj corresponding to node Ni. 

Tindell et al. [Tin94a] have developed an analysis for a simple 
TDMA bus that share similarities with the TTP. In their setting, 
messages are split into packets before being sent. 

In this case, the transmission of a message can span over sev- 
eral rounds. The analysis in [Tin94a] determines the transmis- 
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sion delay based on the number of packets that need to be 
taken from the queue over the time w^(q) in order to guarantee 
the transmission of the last packet of m. Thus, the propagation 
delay depends on the number of busy periods q being examined. 

Thus, the transmission delay of a message m sent as 
packets over a slot S, is equal to: 



C 



m 







( 6 . 12 ) 



where Sp is the size of slot S in number of packets and Sg is the 
size of the slot in number of bits. 

In the case when messages cannot be split into packets and 
thus the transmission of a message is done in one single round, 
the propagation delay is equal to the slot size in which mes- 
sage m is being transmitted, and thus is not influenced by the 
number of busy periods being examined. 

The details of message transmission in such a setting are pre- 
sented in Section 3.4. When a message is produced by a sender 
process, all its packets are placed in the Out queue (Figure 3.6 
on page 56). Packets are ordered according to their priority. At 
its activation, the message transfer process takes a certain num- 
ber of packets from the head of the Out queue and constructs a 
frame. The number of packets accepted is decided so that their 
total size does not exceed the length of the data field of the 
frame. This length is limited by the size of the slot corresponding 
to the respective processor. Since the messages are produced 
dynamically, they have to be identified in a certain way so that 
they are recognized when the frame arrives at the delivery pro- 
cess. Thus, each message has several identifier bits appended at 
the beginning of the message. 

Since the packets are dynamically packed into frames in the 
order they are sorted in the queue, the worst-case queuing delay 
to the communication channel for a packet p depends on the 
number of packets queued ahead of it. 
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The analysis in [Tin94a] bounds the number of queued ahead 
packets of messages of higher priority than a message m by: 









m' m' 



TDM A’ 



(6.13) 



where is the number of packets of message w, is the size of 

the slot (in number of packets) corresponding to m, and 






\/m^ G hp{m) 






S{mj) 



(6.14) 



where pj is the number of packets of a message rrij. 

The analysis assumes that the period of any message m is 
longer or equal to the length of a TDMA round, (see 

Figure 3.2 on page 48 and Figure 6.10 on page 167). 



6.5 Schedulability Analysis for the Time 
Triggered Protocol 

Although there are many similarities with the general TDMA 
protocol, the analysis in the case of TTP is different in several 
aspects from the one outlined in the previous sections, and also 
differs to a large degree depending on the policy chosen for mes- 
sage scheduling. 

The four approaches we propose for scheduling of messages 
using TTP differ in the way the messages are allocated to the com- 
munication channel (either statically or dynamically), and 
whether they are split into packets for transmission or not. The 
next sections present the analysis for each approach as well as the 
degrees of liberty a designer has, in each of the cases, for opti- 
mizing the MEDL (the static schedule table for messages, see 
Section 3.2.1). 
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Before going into details for each of the message scheduling 
approaches proposed by us we have to mention how we account 
on each processors, for the overheads due to transmission. 

The overhead produced by the communication activities must 
be accounted for not only as part of the response time for a mes- 
sage, but also through its influence on the response time of pro- 
cesses running on the same processor. We consider this influence 
during the schedulability analysis of processes on each proces- 
sor. We assume that the worst case computation time of the 
transfer process (T in Figure 3.6 on page 56) is known, and that 
it is different for each of the four message scheduling 
approaches. Based on the respective MHTT, the transfer process 
is activated for each frame sent. Its worst-case period is derived 
from the minimum time between successive frames. 

The response time of the delivery process {D in Figure 3.6), is 
considered as part of the communication delay. The influence 
due to the delivery process must be also included when analyz- 
ing the response time of the processes running on the respective 
processor. We consider the delivery process during the schedula- 
bility analysis in the same way as the message transfer process. 

6.5.1 Static Single Message Allocation (SM) 

The first approach to scheduling of messages using ttp is to stat- 
ically (off-line) schedule each of the messages into a slot of the 
TDMA cycle, corresponding to the node sending the message. This 
means that for each message we decide off-line to allocate space 
in one or more frames, space that can only be used by that par- 
ticular message. In Figure 6.10 the frames are denoted by rect- 
angles. In this particular example, it has been decided to 
allocate space for message m in slot of the first and third 
rounds. 

Since the messages are dynamically produced by the pro- 
cesses, the exact moment a certain message is generated cannot 
be predicted. Hence, it can happen that certain frames will be 
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left empty during execution. For example, if there is no message 
m in the Out queue (see Figure 3.6) when the slot S-^ of the first 
round in Figure 6.10 starts, that frame will carry no informa- 
tion. A message m produced immediately after slot starts to 
be transmitted, could then be carried by the frame scheduled in 
the slot Sj of the third round. 

In the SM approach, we consider that the slots can hold each at 
most one single message. This approach is well suited for appli- 
cation areas, like safety-critical automotive electronics, where 
the messages are typically short and the ability to easily diag- 
nose the system (fewer messages in a frame are easier to 
observe) is critical. In the automotive electronics area messages 
are typically a couple of bytes, encoding signals like vehicle 
speed. However, for applications using larger messages, the SM 
approach leads to overheads due to the inefficient utilization of 
slot space when transmitting smaller size messages. 

As each slot carries only one fixed, predetermined message, 
there is no interference among messages. If a message m misses 
its allocated frame it has to wait for the following slot assigned 
to m. The worst-case queuing delay for a message m in this 
approach depends on the maximum time between consecutive 
slots of the same node carrying the message m. We denote this 
time by illustrated in Figure 6.10, where we have a system 




Figure 6.10: Worst-Case Arrival Time for SM 
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cycle of length consisting of three TDMA rounds. In this 

case, considering Equation ( 6 . 6 ), the worst-case response time 
of a message m becomes; 

C„) ■ (6.15) 

Therefore, the main aspect influencing schedulability of the 
messages is the way they are statically allocated to slots, which 
determines the values of 0 „j. 0 „j, as well as C^, depend on the slot 
sizes which, in the case of SM, are determined by the size of the 
largest message sent from the corresponding node plus the bits 
for control and CRC, as imposed by the protocol. 

As mentioned before, the analysis in [Tin94a] , done for a sim- 
ple TDMA protocol, assumes that > Tj^dma- of static 

message allocation with TTP (the SM and MM approaches), this 
translates to the condition 

During the S3mthesis of the MEDL, the designer has to allocate 
the messages to slots in such a way that the process set is sched- 
ulable. Since the schedulability of the process set can be influ- 
enced by the S3mthesis of the MEDL only through the 
parameters, these are the parameters which have to be opti- 
mized. 

Example 6.4: Let us consider the simple example depicted 
in Figure 6.11, where we have three processes. Pi, P 2 , and P 3 
running each on a different processor. When process Pi fin- 
ishes executing it sends message rui to process P 3 and mes- 
sage m 2 to process ^ 2 - TDMA configurations presented 

in Figure 6.11, only the slot corresponding to the CPU run- 
ning Pi is important for our discussion and the other slots 
are represented with light gray. With the configuration in 
Figure 6.11a, where the message mi is allocated to the 
rounds one and four and the message m 2 is allocated to 
rounds two and three, process P 2 misses its deadline because 
of the release jitter due to the message m 2 in Round 2. How- 
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ever, if we have the TDMA configuration depicted in 
Figure 6.11b, where is allocated to the rounds two and 
four and m 2 is allocated to the rounds one and three, all the 
processes meet their deadlines. 

■ 

6.5.2 Static Multiple Message Allocation (MM) 

This second approach is an extension of the first one. In this 
approach we allow more than one message to be statically 
assigned to a slot and all the messages transmitted in the same 
slot are packed together in a frame. As with the SM approach, 
there is no interference among messages, so the worst case 
access delay for a message m is the maximum time between con- 
secutive slots of the same node carr 3 dng the message m, 0,„. It is 
also assumed that 

However, this approach offers more freedom during the S 3 m- 
thesis of the MEDL. We have now to decide also on how many and 
which messages should be packed in a slot. This allows more 
flexibility in optimizing the 0 ^ parameter. 

Example 6.5: To illustrate this, let us consider the same 
example depicted in Figure 6.11. With the MM approach, the 
TDMA configuration can be arranged as depicted in 
Figure 6.11c, where the messages m-^ and m 2 are put 
together in the same slot in the first and second rounds. 
Thus, the deadline is met and the release jitter is further 
reduced compared to the case presented in Figure 6.11b 
where process P 3 was experiencing a large release jitter. 

■ 

6.5.3 Dynamic Message Allocation (DM) 

The previous two approaches have statically allocated one or 
more messages to their corresponding slots. This third approach 
considers that the messages are dynamically allocated to 
frames, as they are produced. 
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Thus, as soon as a message is produced, it is placed in the Out 
queue (see Figure 3.6 on page 56), ordered according to the mes- 
sage priorities. When the transfer process is activated on node 
iVj, it removes from the head of the queue a number of messages, 
such that the total size does not exceed the length of the data 
field of the frame allocated to slot Sj. Since the messages are 
sent dynamically, we have to identify them in a certain way so 
that they are recognized when the frame arrives at the delivery 
process. We consider that each message has several identifier 
bits appended at the beginning of the message. 

We dynamically pack messages into frames in the order they 
are sorted in the queue, thus, the worst-case queuing delay to 
the communication channel for a message m depends on the 
number of messages queued ahead of it. 

The analysis in [Tin94a] for a simple TDMA bus, presented in 
Section 6.4.2, bounds the number of queued-ahead packets of 
messages of higher priority than message m, as in their case it is 
considered that a message can be split into packets before it is 
transmitted on the communication channel (Equation 6.13). We 
use the same analysis but we have to apply it for the number of 
messages instead of packets. We have to consider that messages 
can be of different sizes as opposed to packets which always are 
of the same size. 

Therefore, the total size of higher priority messages queued 
ahead of a message m, in the worst case, is: 






E hp(m) 



w^(q) + r^ 



S{mj) 






Sy- 



ce. 16) 



where Sj is the size of the message mj, rg^J) is the response time of 
the process sending message mp and Tj is the period of the mes- 



sage mp 

Further, we calculate the worst-case time that a message m 
spends in the Out queue. The number of TDMA rounds needed, in 
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the worst case, for a message m placed in the queue to be 
removed from the queue for transmission is 



iq + VSm+Im(^m 



(q)) 






(6.17) 



where is the size of the message m and Sg is the size of the 
slot transmitting m (we assume, in the case of DM, that for any 
message x, This means that the worst case time a mes- 

sage m spends in the Out queue is given by 









TDM A’ 



(6.18) 



where T^^dma time taken for a TDMA round. 

Since the size of the messages is fixed for a given application, 
the parameter that will be optimized during the synthesis of the 
MEDL is the slot size. 



Example 6.6: To illustrate how the slot size influences 
schedulability, let us consider the example in Figure 6.12 
where we have the same setting as for the example in 
Figure 6.11. The difference is that we consider message 
having a higher priority than message m 2 and we schedule 
the messages dynamically as they are produced. 

With the configuration in Figure 6.12a message will be 
d 3 mamically scheduled first in the slot of the first round, 
while message m 2 will wait in the Out queue until the next 
round comes, thus causing process P 2 to miss its deadline. 
However, if we enlarge the slot so that it can accommodate 
both messages, message m 2 does not have to wait in the 
queue and it is transmitted in the same slot as Therefore, 
P 2 will meet its deadline as presented in Figure 6.12b. 
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However, in general, increasing the length of slots does not 
necessarily improve schedulability, as it delays the communi- 
cation of messages generated by other nodes. 

■ 

6.5.4 Dynamic Packet Allocation (DP) 

This approach is an extension of the previous one, as we allow 
the messages to be split into packets before they are transmitted 
on the communication channel. We consider that each slot has a 
size that accommodates a frame with the data field being a mul- 
tiple of the packet size. The analysis for this case is similar to 
the one outlined in Section 6.4.2 for the simple TDMA bus. 

This approach is well suited for the application areas that typ- 
ically have large message sizes. By splitting messages into pack- 
ets, we can obtain a higher utilization of the bus and reduce the 
release jitter. However, since each packet has to be identified as 
belonging to a message, and messages have to be split at the 
sender and reconstructed at the destination, the overhead 
becomes higher than in the previous approaches. 

In the previous approach (DM) the optimization parameter for 
the synthesis of the MEDL was the size of the slots. With this 
approach we can also decide on the packet size, which becomes 
another optimization parameter. 

Example 6.7: Consider the example in Figure 6.12c where 
messages m-^ and m .2 have a size of 6 bytes each. The packet 
size is considered to be 4 bjdes and the slot corresponding to 
the messages has a size of 12 bytes (three packets) in the 
TDMA configuration. Since message has a higher priority 
than 1712, will be d 5 mamically scheduled first in the slot of 
the first round and it will need two packets. In the third 
packet the first 4 bjdes of m 2 are placed. Thus, the remaining 
2 bytes of message m 2 have to wait for the next round, caus- 
ing process P 2 to miss its deadline. However, if we change the 
packet size to 3 bytes and keep the same size of 12 b 5 rtes for 
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a) P2 misses its deadline; there is no space in the slot of the first round 
to schedule the lower priority message m2 




ni j m2 m2 

b) All processes meet their deadlines; the slot has been enlarged 
to hold both messages 




I ■■ ■ I ■ I 

^ 1 *^2/pac:ket 1 ^2/packt^t 2 ^ 1 1 2 

c) P2 misses its deadline; the slot is too small 
to hold both packets of message m2 




m^ m 2 ni|m2 

d) All processes meet their deadlines; the slot has been enlarged 
to hold 4 packets instead of 3 

□ Release Jitter □ Running process ■ Message 
+ Process activation | Deadline 

Figure 6.12: Optimizing the MEDL for DM and DP 
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the slot, we have four packets in the slot corresponding to the 
CPU running (Figure 6.12d). Message will be dynami- 
cally scheduled first and will need 2 packets in the slot of the 
first round. Hence, m 2 can be sent in the same round so that 
P 2 can meet its deadline. 

■ 

In the above example, with one single sender processor and 
the particular message and slot sizes as given, the problem 
seems to be simple. This is, however, not the case in general. For 
example, the packet size which fits a particular node can be 
unsuitable in the context of the messages and slot size corre- 
sponding to another node. At the same time, reducing the pack- 
ets size increases the overheads due to the transfer and delivery 
processes. 

6 . 6 Schedulability Analysis for Event-Driven 
Systems 

The previous sections have built, step by step, a response time 
analysis for event driven systems that is able to handle: 

• data dependencies in applications, 

• distributed architectures, 

• the details of several communication protocols, and 

• control dependencies. 

Therefore, at this point we have two analyses (see the boxes 
with a thick border in Figure 6.2): 

1. A response time analysis for applications with data and con- 
trol dependencies distributed over a multi-processor archi- 
tecture that uses TTP as the communication protocol. In 
Section 6.7 we will show how this analysis can be used to 
drive a bus access optimization process for the time-trig- 
gered protocol. 

2. A similar response time analysis, but for CAN. This analysis 
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will be used in Chapter 8 for the event-triggered cluster of a 
multi-cluster system. 

6.6.1 Degree of Schedulability 

To determine if an application is schedulable, it is enough to 
compare the response times with the deadlines. However, in 
order to drive our optimization heuristics aiming at obtaining a 
schedulable system at a low cost, we need a metric that is able to 
indicate which of the design alternatives is “more schedulable” 
than the others. 

Thus, in order to guide the optimization process, we have used 
as a cost function the “degree of schedulability.” Our cost func- 
tion is similar to that in [Tin92] in the case an application is not 
schedulable (c^). However, in order to distinguish between sev- 
eral schedulable applications, we have introduced the second 
expression C 2 , which measures, for a feasible design alternative, 
the total difference between the response times and the dead- 
lines. We use the following function in order to express the 
degree of schedulability: 

n 

Cl = -D-) , if Cl > 0 

degree of schedulability {design)= I 

n 

C2= = ^ 

i = 1 

where n is the number of processes in the application, is the 
worst-case response time of a process Pj, and Dj is the deadline 
of a process Pj. If the application is not schedulable, there exists 
at least one greater than the deadline Dj, therefore the term 
Cl of the function will be positive. In this case the cost function is 
equal to Ci. 

However, if the application is schedulable, then each Pj is 
smaller than the corresponding deadline Dj. In this case Ci = 0 
and we use C 2 as the cost function, as it is able to differentiate 
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between two alternatives, both leading to a schedulable applica- 
tion. For a given design implementation leading to a schedulable 
application, a smaller C 2 means that we have improved the 
response times of the processes, so the application can be poten- 
tially implemented on a cheaper hardware architecture (with 
slower processors and/or bus, but without increasing the num- 
ber of processors or buses). 



6.7 Bus Access Optimization 

Once a schedulability analysis for event-driven distributed real- 
time systems is in place, our problem is to analyze the schedula- 
bility of a given process set and to synthesize the MEDL of the TTP 
controllers (and consequently the MHTTs) so that the application 
is schedulable on an as cheap as possible architecture. The opti- 
mization is performed on the parameters which have been iden- 
tified for each of the four approaches to message scheduling 
discussed before (see Section 6.5). In order to guide the optimi- 
zation process, we have used as a cost function the degree of 
schedulability calculated for a given MEDL implementation. 

For a given application, we are interested to synthesize a 
MEDL such that the degree of schedulability cost function is min- 
imized. We are also interested to evaluate in different contexts 
the four approaches to message scheduling, thus offering the 
designer a decision support for choosing the approach that best 
fits his application. 

The MEDL S5mthesis problem belongs to the class of exponen- 
tial complexity problems, therefore we are interested to develop 
heuristics that are able to find accurate results in a reasonable 
time. We have developed optimization algorithms corresponding 
to each of the four approaches to message scheduling. A first set 
of algorithms presented in Section 6.7.1 is based on simple and 
fast greedy heuristics. In Section 6.7.2 we introduce a second 
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class of heuristics which aims at finding near-optimal solutions 
using the simulated annealing (SA) algorithm. 

6.7.1 Greedy Heuristics 

We have developed greedy heuristics for each of the four 
approaches to message scheduling. The main idea of the heuris- 
tics is to minimize the cost function by incrementally trying to 
reduce the communication delay of messages and, by this, the 
release jitter of the processes. 

The only way to reduce the release jitter in the SM and MM 
approaches is through the optimization of the 0,^ parameters. 
This is achieved by a proper placement of messages into slots 
(see Figure 6.11). 

The OptimizeSM algorithm presented in Figure 6.13 starts by 
deciding on a size (sizes.) for each of the slots. The slot sizes are 
set to the minimum size that can accommodate the largest mes- 
sage sent by the corresponding node (lines 1-4 in Figure 6.13). 
In this approach a slot can carry at most one message, thus slot 
sizes larger than this size would lead to larger response times. 

Then, the algorithm has to decide on the number of rounds, 
thus determining the size of the MEDL. Since the size of the MEDL 
is physically limited, there is a limit to the number of rounds 
(e.g., 2, 4, 8, 16 depending on the particular TTP controller 
implementation). However, there is a minimum number of 
rounds MinRounds, which are necessary for a certain application, 
and depends on the number of messages transmitted (lines 5-9). 
For example, if the processes mapped on node send in total 
seven messages then we have to decide on at least seven rounds 
in order to accommodate all of them (in the SM approach there is 
at most one message per slot). Several numbers of rounds 
RoundsNo are tried out by the algorithm starting from MinRounds 
up to MaxRounds (lines 15-31). 

For a given number of rounds (that determine the size of the 
MEDL), the initially empty MEDL has to be populated with mes- 
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sages in such a way that the cost function is minimized. In order 
to apply the schedulability analysis, which is the basis for the 
cost function, a complete MEDL has to be provided. A complete 
MEDL contains at least one instance of every message that has to 
be transmitted between the processes on different processors. A 



OptimizeSM 

1 — set the slot sizes 

2 for each node A/, do 

3 sizosj = max(size of messages m, sent by node N) 

4 end for 

5 - find the minimum number of rounds that can hold all the messages 

6 for each node A/, do 

7 nm, = number of messages sent from A/, 

8 end for 

9 MinRounds = max (nm,) 

10 -- create a minimal complete MEDL 

11 for each message m, do 

12 find round in [1 ..MinRounds] that has an empty slot for mi 

13 place m, into its slot in round 

14 end for 

15 for each RoundsNo in [Min Rounds... MaxRounds] do 

16 — insert messages in such a way that the cost is minimized 

17 repeat 

18 for each process P, that receives a message m, do 

19 if D, - R; is the smallest so far then m = mp, end if 

2 0 end for 

21 for each round in [1 ..RoundsNo] do 

22 place m into its corresponding slot in round 

23 calculate the CosfFuncf/on 

24 if the CostFunction is smallest so far then 

25 BestRound = round 

2 6 end if 

2 7 remove m from its slot in round 

2 8 end for 

2 9 place m into its slot in BestRound if one was identified 

30 untii the CostFunction is not improved 

31 end for 

end OptimizeSM 

Figure 6.13: Greedy Heuristic for SM 
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minimal complete MEDL is constructed from an empty MEDL by 
placing one instance of every message mi into its corresponding 
empty slot of a round (lines 10-14). 

Example 6.8: In Figure 6.11a, for example, we have a MEDL 
composed of four rounds. We get a minimal complete MEDL, 
for example, by assigning m 2 and to the slots in rounds 
three and four, and letting the slots in rounds one and two 
empty. 

■ 

However, such a MEDL might not lead to a schedulable system. 
The degree of schedulability can be improved by inserting 
instances of messages into the available places in the MEDL, thus 
minimizing the parameters. 

Example 6.9: For example, in Figure 6.11a, by inserting 
another instance of the message m^ in the first round and mg 
in the second round leads to Pg missing its deadline, while in 
Figure 6.11b inserting m^ into the second round and mg into 
the first round leads to a schedulable system. 

■ 

Our algorithm repeatedly adds a new instance of a message to 
the current MEDL in the hope that the cost function will be 
improved (lines 16—30). In order to decide an instance of which 
message should be added to the current MEDL, a simple heuristic 
is used. We identify the process which has the most “critical” 
situation, meaning that the difference between its deadline and 
response time, Di — Ri, is minimal compared with all other pro- 
cesses. The message to be added to the MEDL is the message 
m = mp. received by the process Pj (lines 18-20). Message m will 
be placed into that round (BestRound) which corresponds to the 
smallest value of the cost function (lines 21-28). The algorithm 
stops if the cost function cannot be further improved by adding 
more messages to the MEDL. 

The OptimizeMM algorithm is similar to OptimizeSM. The main 
difference is that in the MM approach several messages can be 
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placed into a slot (which also decides its size), while in the SM 
approach there can be at most one message per slot. Also, in the 
case of MM, we have to take additional care that the slots do not 
exceed the maximum allowed size for a slot. 

The situation is simpler for the dynamic approaches, namely 
DM and DP, since we only have to decide on the slot sizes and, in 
the case of DP, on the packet size. For these two approaches, the 
placement of messages is dynamic and has no influence on the 
cost function. The Optimize DM algorithm in see Figure 6.14 starts 
with the first slot S| = of the TDMA round and tries to find that 

size (BestSizes.) which corresponds to the smallest CostFunction 
(lines 4-14 in Figure 6.14). This slot size has to be large enough 
(S| > MinSizeg.) to hold the largest message to be transmitted in 
this slot, and within bounds determined by the particular TTP 
controller implementation (e.g., from 2 bits up to MaxSize = 32 
bytes). Once the size of the first slot has been determined, the 
algorithm continues in the same manner with the next slots 
(lines 7-12). 



OptimizeDM 

1 for each node A/, do 

2 MinSizesi = max(size of messages m, sent by node N) 

3 end for 

4 - identifies the size that minimizes the cost function 

5 for each slot S,do 

6 BestSizesi = MinSizeg^ 

7 for each SlotSize in [MinSizegi-.-MaxSize] do 

8 calculate the CostFunction 

9 if the CostFunction is best so far then 

10 BestSizeg.= SlotSizGg. 

11 end if 

12 end for 

13 sizeg.= BestSizog. 

14 end for 
end OptimizeDM 

Figure 6.14: Greedy Heuristic for DM 
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The OptimizeDP algorithm has also to determine the proper 
packet size. This is done by trying all the possible packet sizes 
given the particular TTP controller. For example, it can start 
from 2 bits and increment with the “smallest data unit” (typi- 
cally two bits) up to 32 bytes. In the case of the OptimizeDP algo- 
rithm the slot size has to be determined as a multiple of the 
packet size and within certain bounds depending on the TTP con- 
troller. 

6.7.2 Simulated Annealing Strategy 

We have also developed an optimization procedure based on a 
simulated annealing (SA) strategy, described in Appendix A. The 
main characteristic of such a strategy is that it tries to find the 
global optimum by randomly selecting a new solution from the 
neighbors of the current solution. 

The neighbors of the current solution are obtained depending 
on the chosen message scheduling approach. For SM, the next 
solution is obtained from the current one by inserting or remov- 
ing a message in one of its corresponding slots. In the case of MM, 
we have to take additional care that the slots do not exceed the 
maximum allowed size (which depends on the controller imple- 
mentation), as we can allocate several messages to a slot. For 
these two static approaches we also decide on the number of 
rounds in a cycle (e.g., 2, 4, 8, 16, limited by the size of the mem- 
ory implementing the MEDL). In the case of DM, the neighboring 
solution is obtained by increasing or decreasing the slot size 
within the bounds allowed by the particular TTP controller 
implementation, while in the DP approach we also increase or 
decrease the packet size. 

We have also tuned the parameters Tl (initial temperature), 
TL (temperature length), and e (cooling ratio) that define the 
cooling schedule (see Appendix A for the details on these param- 
eters). For example, for the graphs with 320 nodes, Tl is 300, TL 
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is 500 and e is 0.95. The algorithm stops if for three consecutive 
temperatures no new solution has been accepted. 



6.8 Experimental Evaluation 

In Section 6.8.1, the results for the schedulability analysis of 
systems with data and control dependencies (Section 6.3) are 
presented. Section 6.8.2 first presents the experimental results 
for the schedulability analysis with the TTP (Section 6.4), com- 
paring the four message scheduling approaches. Then, we 
present the results obtained for the communication synthesis 
problem outlined in Section 6.7. The approaches presented in 
this chapter are further evaluated in Section 6.8.3 using the 
vehicle cruise controller model from Section 2.3.3. 

6.8.1 Schedulability Analysis for Systems with Control 
AND Data Dependencies 

In this section we present some experimental results regarding 
the schedulability analysis for conditional process graphs which 
has been discussed in Section 6.3. The two main aspects we were 
interested in are the quality of the schedulability analysis and 
the scalability of the algorithms for large examples. A better 
quality of a schedulability analysis, in this case, means a lower 
degree of pessimism. A set of massive experiments were per- 
formed on conditional process graphs generated for experimen- 
tal purpose. 

We considered architectures consisting of 2, 4, 6, 8 and 10 pro- 
cessors. Forty processes were assigned to each node, resulting in 
applications of 80, 160, 240, 320 and 400 processes, having 2, 4, 
6, 8 and 10 conditions, respectively. The number of uncondi- 
tional subgraphs varied for each application dimension depend- 
ing on the number of conditions and the randomly generated 
structure of the CPGs. For example, for applications with 400 
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processes, the maximum number of unconditional subgraphs 
was 64. 

Thirty applications were generated for each dimension, thus a 
total of 150 graphs were used for experimental evaluation. 
Worst case execution times were assigned randomly using both 
uniform and exponential distribution. The experiments were 
also run on a SUN ultra 10 workstation. 

In order to compare the quality of the schedulability 
approaches, we need a cost function that captures, for a certain 
system, the difference in quality between the schedulability 
approaches proposed (Section 6.3). Our cost function is the dif- 
ference between the deadline and the worst-case delay of a CPG, 
summed for all the CPGs in the system: 

n 

cost function = ~ ^q ) (6.19) 

i = l 

where n is the number of CPGs in the application, 6 q. is the 
worst-case delay of the CPG Gj, and Dq. is the deadline on Gj. A 
higher value for this cost function, for a given system, means 
that the corresponding approach produces better results (sched- 
ulability analysis is less pessimistic). 

For each of the 150 generated example systems and each of 
the five schedulability analyses we have calculated the cost 
function mentioned previously, based on results produced with 
the algorithms described in Section 6.3. These values, for a 
given system, differ from one analysis to another, with the BF 
being the least pessimistic approach and therefore having the 
largest value for the cost function. 

We are interested to compare the approaches based on the val- 
ues obtained for the cost function (Equation 6.19). Figure 6.15a 
presents the average percentage deviations of the cost function 
obtained in each of the approaches, compared to the value of the 
cost function obtained with the BF approach. A smaller value for 
the percentage deviation means a larger cost function, thus a 



184 




SCHEDULABILITY ANALYSIS AND BUS ACCESS OPTIMIZATION 





Figure 6.15: Comparison of the 
Schedulability Approaches for CPGs 
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better result. The percentage deviation is calculated according 
to the formula: 

deviation = ~ ,g 

“S*6es( 

Figure 6.15b presents the average runtime of the algorithms, 
in seconds. 

The brute force approach, BF, performs best in terms of quality 
and obtains the largest values for the cost function at the 
expense of a large execution time. The execution time can be up 
to 7 minutes for large graphs of 400 processes, 10 conditions, and 
64 unconditional subgraphs. At the other end, the straightfor- 
ward approach IC that ignores the conditions, performs worst 
and becomes more and more pessimistic as the system size 
increases. As can be seen from Figure 6.15, IC has even for 
smaller systems of 160 processes (3 conditions, maximum 8 
unconditional subgraphs) a 50% worse quality than the brute 
force approach, with almost 80% loss in quality, in average, for 
large systems of 400 processes. It is interesting to mention that 
the low quality IC approach has also an average execution time 
which is equal or comparable to the much better quality heuris- 
tics (except BF, of course). This is because it tries to improve on 
the worst-case delays through the iterative loop presented in 
DelayEstimate, Figure 6.3. 

Let us turn our attention to the three approaches CS, RTl, and 
RT2 that, like BF, consider conditions during the analysis but 
also try to perform a trade-off between quality and execution 
time. Figure 6.15 shows that the pessimism of the analysis is 
dramatically reduced by considering the conditions during the 
analysis. The RTl and RT2 approaches, which visit each uncondi- 
tional subgraph, perform in average better than the CS approach 
that considers condition separation for the whole graph. How- 
ever, CS is comparable in quality with RTl, and even performs 
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better for graphs of size smaller than 240 processes (4 condi- 
tions, maximum 16 subgraphs). 

The RT2 analysis that tries to improve the worst case response 
times using the MaxSeparations, as opposed to RTl, performs best 
among the non-brute-force approaches. As can be seen from 
Figure 6.15, RT2 has less than 20% average deviation from the 
solutions obtained with the brute force approach. However, if 
faster runtimes are needed, RTl can be used instead, as it is 
twice faster in execution time than RT2. 

We were also interested to compare the approaches with 
respect to the number of unconditional subgraphs and the num- 
ber of conditional process graphs in an application. For the 
results depicted in Figure 6.16 we have assumed CPGs consisting 
of 2, 4, 8, 16, and 32 unconditional subgraphs of maximum 50 
processes each, allocated to 8 processors. Figure 6.16 shows that 
as the number of subgraphs increases, the differences between 
the approaches grow while the ranking among them remains the 
same, as resulted from Figure 6.15. The CS approach performs 
better than RTl with a smaller number of subgraphs, but RTl 
becomes better as the number of subgraphs in the CPGs 
increases. 

Figure 6.17 presents on a logarithmic scale the average per- 
centage deviations for systems consisting of 1, 2, 3, 4 and 5 con- 
ditional process graphs of 160 nodes each. As the number of 
conditional process graphs increases, the IC and CS approaches 
become more pessimistic. However, RTl and RT2 perform very 
well, with RT2 being the least pessimistic approach (except the 
BF approach, which is not depicted in Figure 6.17). 

6.8.2 SCHEDULABILITY ANALYSIS WITH TTP AND 
Bus Access Optimization 

For the evaluation of our message scheduling approaches over 
TTP we used applications generated for experimental purpose. 
We considered architectures consisting of 2, 4, 6, 8 and 10 nodes. 
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Number of unconditional subgraphs 



Figure 6 . 16 : Comparison of the Schedulability 
Analysis Approaches for CPGs Based on the 
Number of Unconditional Subgraphs 



Forty processes were assigned to each node, resulting in sets of 
80, 160, 240, 320 and 400 processes. Thirty applications were 
generated for each of the five dimensions. Thus, a total of 150 
applications were used for experimental evaluation. Worst-case 
computation times, periods, deadlines, and message lengths 
were assigned randomly within certain intervals. For the com- 
munication channel we considered a transmission speed of 256 
Kbps. The maximum length of the data field in a slot was 32 
bytes and the frequency of the TTP controller was chosen to be 20 
MHz. These experiments were also run on a SUN Ultra 10 work- 
station. 
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Number of conditional process graphs 



Figure 6.17; Comparison of the Schedulability Analysis 
Approaches Based on the Number of CPGs 

For each of the 150 generated examples and each of the four 
message scheduling approaches we have obtained the near-opti- 
mal values for the cost function (degree of schedulability, 
Section 6.6.1) as produced by our SA based algorithm (see 
Section 6.7.2). For a given example, these values might differ 
from one message passing approach to another, as they depend 
on the optimization parameters and the schedulability analysis 
which are particular for each approach. Figure 6.18 presents a 
comparison based on the average percentage deviation of the 
cost function obtained for each of the four approaches, from the 
minimal value among them. The percentage deviation is calcu- 
lated according to the Equation 6.20. 



189 





Chapter 6 



The DP approach is, generally, able to achieve the highest 
degree of schedulability, which in Figure 6.18 corresponds to the 
smallest deviation. In the case the packet size is properly 
selected, by scheduling messages d 5 mamically we are able to 
efficiently use the available space in the slots, and thus reduce 
the release jitter. However, by using the MM approach we can 
obtain almost the same result if the messages are carefully allo- 
cated to slots as does our optimization strategy. 

Moreover, in the case of larger process sets, the static 
approaches suffer significantly less overhead than the dynamic 
approaches. In the SM and MM approaches the messages are 
uniquely identified by their position in the MEDL. However, for 
the dynamic approaches we have to somehow identify the 
d 3 mamically transmitted messages and packets. Hence, for the 
DM approach we consider that each message has several identi- 




Figure 6.18: Comparison of the Four Approaches to 
Message Scheduling over TTP 
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fier bits appended at the beginning of the message, while for the 
DP approach the identification bits are appended to each packet. 
Not only the identifier bits add to the overhead, but in the DP 
approach, the transfer and delivery processes (see Figure 3.6 on 
page 56) have to be activated for each sending and receiving of a 
packet, and so interfere with the other processes. 

Therefore, for larger applications (e.g., 400 processes), MM out- 
performs DP, as DP suffers from large overhead due to its 
dynamic nature. DM performs worse than DP because it does not 
split the messages into packets, and this results in a mismatch 
between the size of the messages dynamically queued and the 
slot size, leading to unused slot space that increases the jitter. 
SM performs the worst as it does not permit much room for 
improvement, leading to large amounts of unused slot space. 
Also, DP has produced a MEDL that resulted in schedulable appli- 
cations for 1.33 times more cases than the MM and DM. MM, in its 
turn, produced two times more schedulable results than the SM 
approach. 

Together with the four approaches to message scheduling a so 
called ad-hoc approach is also presented. The ad-hoc approach 
performs scheduling of messages without trying to optimize the 
access to the communication channel. The ad-hoc solutions are 
based on the MM approach and consider a design with the TDMA 
configuration consisting of a simple, straightforward allocation 
of messages to slots. The lengths of the slots were selected to 
accommodate the largest message sent from the respective node. 
Figure 6.18 shows that the ad-hoc alternative is constantly out- 
performed by any of the optimized solutions. This demonstrates 
that significant gains can be obtained by optimization of the 
parameters defining the access to the communication channel. 

Next, we have compared the four approaches with respect to 
the number of messages exchanged between different nodes and 
the maximum message size allowed. For the results depicted in 
figures 6.19 and 6.20 we have assumed applications of 80 pro- 
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Figure 6.19; Four Approaches to Message Scheduling over 
TTP: The Influence of the Messages Number 



cesses allocated to 4 nodes. Figure 6.19 shows that, as the num- 
ber of messages increases, the difference between the 
approaches grows while the ranking among them remains the 
same. The same holds for the case when we increase the maxi- 
mum allowed message size (Figure 6.20), with a notable excep- 
tion: for large message sizes MM becomes better than DP, since 
DP suffers from larger overhead due to its dynamic nature. 

The above comparison between the four message scheduling 
alternatives is mainly based on the issue of schedulability. How- 
ever, when choosing among the different policies, several other 
parameters can be of importance. For example, a static alloca- 
tion of messages can be beneficial from the point of view of test- 
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Maximum number of bytes in a message 



Figure 6.20: Four Approaches to Message Scheduling over 
TTP: The Influence of the Message Sizes 



ing and debugging and has the advantage of simplicity. Similar 
considerations can lead to the decision not to split messages. In 
any case, however, optimization of the bus access scheme is 
highly desirable. 

We were also interested in the quality of our greedy heuristics. 
Thus, we have run all the examples presented above, using the 
greedy heuristics and compared the results with those produced 
by the SA based algorithm. Table 6.2 shows the average and max- 
imum percentage deviations of the cost function values produced 
by the greedy heuristics from those generated with SA, for each of 
the application dimensions. All the four greedy heuristics per- 
form very well, with less than 2% loss in quality compared to the 
results produced by the SA algorithms. The execution times for 
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Table 6.2: Percentage Deviations for the 
Greedy Heuristics Compared to Simulated Annealing 



Processes 


80 


160 


240 


320 


400 


SM 


aver. 


0.12% 


0.19% 


0.50% 


1.06% 


1.63% 




max. 


0.81% 


2.28% 


8.31% 


31.05% 


18.00% 


MM 


aver. 


0.05% 


0.04% 


0.08% 


0.23% 


0.36% 




max. 


0.23% 


0.55% 


1.03% 


8.15% 


6.63% 


DM 


aver. 


0.02% 


0.03% 


0.05% 


0.06% 


0.07% 




max. 


0.05% 


0.22% 


0.81% 


1.67% 


1.01% 


DP 


aver. 


0.01% 


0.01% 


0.05% 


0.04% 


0.03% 




max. 


0.05% 


0.13% 


0.61% 


1.42% 


0.54% 



the greedy heuristics were more than two orders of magnitude 
smaller than those with SA. 

6.8.3 The Vehicle Cruise Controller 

We have applied our approaches in sections 6.3 and 6.4 to the 
real-life example implementing a vehicle cruise controller 
described in Section 2.3.3: 

• The hardware architecture considered consists of five nodes 
interconnected by a TTP bus, and is presented in Figure 2.7a 
on page 37. 

• We have used the software architecture for event-driven sys- 
tems, outlined in Section 3.4. 

• We have applied the analyses in Section 6.3 to the mapped 
CPG, modeling the CC system presented in Figure 2.9 on 
page 40, but without considering the messages (depicted 
with solid circles in that figure). The deadline in this case 
has been set to 130 ms. 
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• We have also evaluated the scheduling approaches presented 
in Section 6.4 using the CC model in Figure 2.9, considering 
in this case a deadline of 500 ms. 

For the approaches in Section 6.3, that aim at reducing the 
pessimism of the analysis by using the conditions in the model, 
we have obtained the following results. Without considering the 
conditions, IC obtained a worst case delay of 138 ms, thus the 
system resulted as being unschedulable. The system has also 
been declared as unschedulable by the Conditions Separation 
(CS) approach, which has produced a result of 132 ms. 

However, the Brute Force approach (bf) has produced a worst- 
case delay of 124 ms which proves that the system implementing 
the vehicle cruise controller is, in fact, schedulable. Both 
Relaxed Tightness alternatives (RTl and RT2) have produced the 
same worst case delay of 124 ms as the BF. 

For the techniques in Section 6.4, where we have proposed 
four message scheduling approaches using the TTP, we have the 
following results concerning the cruise controller example. The 
ad-hoc solution and the SM approach failed to produce a schedu- 
lable solution (in both cases, 27 out of 32 processes had a 
response time larger than the deadline). 

On the other hand, with the other three approaches, schedula- 
ble solutions were produced, DP generating the smallest cost 
function followed in this order by MM and DM. The deviation of 
the MM approach from DP, calculated according to Equation 6.20, 
was of 2.44%, while for the DM approach the deviation was of 
11.97% from DP. 

Based on these results, and on the individual properties of 
each of the message scheduling approaches (see Section 6.4), the 
designer can decide which approach to use for the cruise control- 
ler implementation. However, the SM approach cannot be used 
for the vehicle cruise controller because it leads to an unschedu- 
lable system. 
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This chapter has constructed, step by step, a schedulability 
analysis for applications with data and control dependencies 
distributed on event-driven systems. The analysis will be used 
in the next chapter to determine the schedulability of the 
designs produced by a mapping and scheduling strategy that 
considers an incremental design process, in a fashion similar to 
our approach discussed in Chapter 5 for time-driven systems. 
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Event-Driven Systems 



In Chapter 5 we have discussed an incremental design strat- 
egy addressed to systems where both processes and messages 
are statically scheduled. However, as mentioned before, consid- 
ering preemptive priority based scheduling for processes, with 
time triggered static scheduling for messages, can be the right 
solution under certain circumstances. 

Hence, in this chapter we concentrate on scheduling and map- 
ping of hard real-time systems which are implemented on dis- 
tributed architectures. Process scheduling is based on a static 
priority preemptive approach while the bus communication is 
performed using the TTP. The mapping and scheduling tasks are 
considered in the context of an incremental design process as 
outlined in Section 2.2. 

In Section 6.5 we have proposed four approaches for schedul- 
ing of messages using TTP that differ in the way the messages 
are allocated to the communication channel (either statically or 
d 5 mamically) and whether they are split or not into packets for 
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transmission. For each of these approaches, we have also devel- 
oped a corresponding schedulability analysis. 

Comparing these four approaches, in Section 6.8.2 we con- 
clude that while the Dynamic Packets Allocation (DP) approach 
performs generally the best, since the dynamic scheduling of 
messages is able to reduce release jitter, but using the Multiple 
Message Allocation (MM) approach we can obtain almost the 
same result if the messages are carefully, off-line, allocated to 
slots. Moreover, in the case of larger process sets MM outper- 
forms DP, as DP suffers from large overhead due to its dynamic 
nature. Also, dm performs worse than DP and MM because it does 
not split the messages into packets, and this results in a mis- 
match between the size of the messages dynamically queued and 
the slot size, leading to unused slot space that increases the jit- 
ter. SM performs the worst as it does not permit much room for 
improvement, leading to large amounts of unused slot space. 

Therefore, for the purpose of this chapter, we consider that the 
messages are scheduled using the mm approach, and for the 
details of the corresponding schedulability analysis the reader is 
referred to Section 6.5. The discussion can easily be extended to 
any of the other three message passing approaches presented 
before. 

The chapter is divided into five sections. The next two sections 
present some aspects of the general mapping and scheduling 
problem for event-driven systems, and the issues related to con- 
sidering these design tasks within an incremental design pro- 
cess. Section 7.3 introduces the detailed problem formulation 
and the quality metrics we have defined. The mapping and 
scheduling strategy is presented in Section 7.4, and the 
approaches are evaluated in Section 7.5. 
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7.1 Application Mapping and Scheduling 

In Section 5.1 of we have discussed some of the problems related 
to mapping and scheduling in the context of time-driven sys- 
tems. In the beginning of this section we are going to have a look 
at the same design tasks, but in the context of event-driven sys- 
tems, without considering, for the moment, an incremental 
design process. The particular issues related to mapping and 
scheduling for event-driven systems in the context of an incre- 
mental design approach will be presented later on in the discus- 
sion. 

Example 7.1: Let us consider the example in Figure 7.1, 
where we have three system nodes Ni, N2, (N^ and ATg 

having the same speed) and the TTP bus. Our task is to map 
Pi, P2 and Pg so that all deadlines are met. Process P^ sends 
message to Pg and message m2 to P2. 

In the configuration presented in Figure 7.1a Pj is mapped 
on ATg, Pg on and P2 is mapped on node N2- Both and 
m2 have to be sent in slot Sg corresponding to node ATg where 
the sender process P^ is mapped. With the bus configuration 
such that is scheduled in the first round while m2 is 
scheduled in the second round, P2 misses its deadline (it has 
to wait for message m2 sent by P^). 

However, with the same message schedule, if we map P^ on 
Ni and Pg on ATg as depicted in Figure 7.1b, and m2 are 
sent in slot (corresponding to node N-^ which comes before 

S3, and P2 does not miss its deadline, receiving message m2 
on time. This could have also been achieved by a different 
scheduling of the messages presented in Figure 7.1c, where 
message m2 is scheduled in the first round, and in the 
second, with the same mapping as in Figure 7.1a. 
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a) Processes on N3, P3 on N^: 
P2 misses its deadline 




b) Processes P^ on N^, P3 on N3: 

Processes P2 mapped on meets its deadline 
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c) Same mapping as in case a, 

message m2 sent first: P2 meets its deadline 




d) Application 
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Figure 7.1: Mapping and Scheduling Example 
for Event-Driven Systems 
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However, as we have shown in Chapter 5 in the context of 
time-driven systems, it is not enough to produce a mapping and 
scheduling so that the system is schedulable if we are to support 
an incremental design process as discussed in Section 2.2. 

Thus, we would like to perform mapping and scheduling such 
that: the timing constraints are satisfied, the modifications to 
the existing applications are minimized, and there is a good 
chance that future applications can easily be added to the 
resulted system. 

Example 7.2: To illustrate the role of mapping and schedul- 
ing in the context of an incremental design process, let us 
consider the example in Figure 7.2, where we have two pro- 
cessors with the same speed connected by a TTP bus. With 
black we represent the set of already running applications \|/ 
while the current application to be mapped and sched- 

uled is represented in gray and consists of two processes and 
three messages. 

In order for a system to be schedulable, a necessary condi- 
tion is that the utilization factor of any node is less than one. 
We say that the processor can be “filled up” with processes 
until it reaches an utilization factor of one (the square 
depicting the processor is full). The utilization factor of a 
process Pj is the ratio between the worst-case execution time 
Cj of that process and its period Tj: Uj=Cj T^. The utilization 
factor of a node is the sum of the utilization factors of all pro- 
cesses mapped on that node. The processes and messages 
that are to be mapped on the processors are depicted as 
blocks. The height of a process block is equal with its utiliza- 
tion factor, while the length of a message block gives the size 
of the message. White space on a processor represents avail- 
able utilization, while white space on the bus represents 
available slack in the schedule table. 

Now, let us suppose that, in the future, another application 
T future has to be mapped on the system. T future consists of two 
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processes and two messages represented as hashed blocks. 

We can observe that the new application can be scheduled 
only in the second case, presented in Figure 7.2c. liT 
has been implemented as in Figure 7.2b, we are not able to 
schedule process P 4 and message of T future- The way our 
current application is mapped and scheduled will influence 
the likelihood of successfully mapping additional functional- 
ity on the system without being forced to modify the imple- 
mentation of already running applications. 



7.2 Mapping and Scheduling in an Incremental 
Design Approach 

We model an application T current ^ set of conditional process 

graphs, as outlined in Section 2.3.1. Thus, for each process Pj we 
know the set of potential nodes on which it could be mapped 
and its worst-case execution time on each of these nodes. The 
underlying architecture is as presented in Section 3.4. We con- 
sider fixed priority preemptive scheduling for processes and a 
time-triggered message passing policy, as imposed by the TTP 
protocol. 

Our goal is to map and schedule an application T current ^ 

system that already implements a set \|/of applications, consid- 
ering the following requirements, outlined previously in 
Chapter 5: 

Requirement a All constraints on are satisfied and 

minimal modifications are performed to 
the applications in \|/ 

Requirement b New applications T future can be mapped on 
top of the resulting system. 

If it is not possible to map and schedule T current without modify- 
ing the already running applications, we have to change the map- 
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ping and scheduling of some applications in \|/ However, even 
with serious modifications performed on it is still possible that 
certain constraints are not satisfied. In this case the hardware 
architecture has to be changed by, for example, adding a new pro- 
cessor. Here we will not discuss this last case, but will concen- 
trate on the situation where a possible mapping and scheduling 
which satisfies requirement a can be found, and this solution has 
to be further improved by considering requirement b. 

In order to achieve our goals, we need certain information to 
be available concerning the set of applications \|/as well as the 
possible future applications T future - Section 2.3.2 we have pre- 
sented the type of information we consider available for the 
existing applications in while in Section 5.2.1 we have shown 
how we can capture the characteristics of future time-driven 
applications. In the next section we outline the characterization 
of future event-driven applications. Moreover, as in the case of 
Chapter 5, we consider that T current can interact with the previ- 
ously mapped applications \|/by reading messages generated on 
the bus by processes in \|jt 

7.2.1 Characterizing Future Applications 

In Section 5.2.1 we have argued that, given a certain limited 
application area (e.g., automotive electronics), it is possible to 
characterize the family of applications which in the future would 
be added to the current system. 

Thus, we consider that, concerning the future applications, we 
know the set Su= ..., U ^, of possible processor uti- 
lization factors for processes, and the set b^, 

buiax) of possible message sizes. The processor utilization factor 
Ui provides a measure of the computational load on a node Ni 
due to a process Pj, and is expressed as 
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The utilization factors in are considered relative to the 
slowest node in the system. All the other nodes are character- 
ized by a speedup factor relative to this slowest node. 

Thus, the utilization factor an entire application is given by 

n 

C/= (7.2) 

i = 1 

We also assume that we know the distributions of probability 
fsifU) for U G Su and /s^(6) for b € 

Example 7.3: For example, we might have utilization fac- 
tors Sjj = {0.02, 0.05, 0.1, 0.2} for the future application. If 
almost half of the processes are assumed to have an utilization 
factor of 0.1, and there is a lower probability of having pro- 
cesses with utilization factors of 0.2 and 0.02, then our distri- 
bution function /sj^(L/) could look like this: /’gj^(0.02) = 0.15, 
fsu(0.05) = 0.25, /s^(0.1) = 0.45, fsu(0.2) = 0.15. 

■ 

Another information is related to the period of future applica- 
tions. In particular, the smallest expected period is 

assumed to be given, together with the expected necessary bus 
bandwidth 6„eed inside such a period T^i^. As will be shown later, 
this information is used in order to provide a fair distribution of 
slacks on the bus. 

7.3 Quality Metrics and Exact Problem 
Formulation 

Similarly to our approach to incremental mapping and schedul- 
ing for time-driven systems in Chapter 5, we develop two design 
criteria and associated metrics for event-driven systems, which 
are able to determine how well a system implementation sup- 
ports an incremental design process. 

We start by observing that a designer will be able to map and 
schedule an application T on top of a system implementing 
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V|/ and T current there are sufficient resources available. In 

our case, the resources are the processor time and the band- 
width on the bus. In the context where processes are scheduled 
according to a fixed priority preemptive policy and messages are 
scheduled statically, having free resources translates into hav- 
ing enough processor capacity, and having space left for mes- 
sages in the bus slots. We measure the processor capacity using 
the available utilization, while the available resources on the 
bus are called slack. 

It is to be noted that the total quantity of computation and 
communication power available on our system after we have 
mapped and scheduled on top of \|iis the same regardless 

of the mapping and scheduling policies used. What depends on 
the mapping and scheduling strategy is the distribution of the 
available utilization on each processor, the size of the individual 
slacks on the bus, and the distribution of slacks along the time 
line. It is the distribution of available utilization and the size 
and distribution of the slacks that characterizes the quality of a 
certain design alternative. In this section we introduce the 
design criteria which reflect the degree to which one design 
alternative meets the requirement b introduced previously. For 
each criterion we provide metrics which quantify the degree to 
which the criterion is met. Relative to processes we have intro- 
duced one criterion which reflects how well the resulted avail- 
able utilization on the nodes fits the requirements of a future 
application. For messages, there are two criteria. The first one 
reflects how well the resulted slack sizes fit a future application, 
and the second criterion expresses how well the slack is distrib- 
uted over time. 
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7.3.1 Processes Related Criterion 

The distribution of available utilization on the nodes, resulted 
after implementation oiT on top of \|< should be such that it 
best accommodates a given family of applications F future’ charac- 
terized by the set Su and the probability distribution as out- 
lined before. 

Example 7.4: Let us consider the example in Figure 7 . 2 , 
where we have two processors and the applications \|/and 
^current already mapped. Suppose that application 
consists of the two processes, P 3 and P 4 . If we schedule 
r current Figure 7.2b it is impossible to fit T future because 

there is not enough available utilization on any of the 
processors that can accommodate process P 4 . A situation as 
the one depicted in Figure 7.2c is desirable, where the 
resulted available utilization is such that the future 
application can be accommodated. 

■ 

In order to measure the degree to which the available utiliza- 
tion in a given design alternative fits the future applications, we 
provide a metric Cf which indicates to what extent the largest 
future application (considering the sum of available process uti- 
lization) could be mapped on top of the current design. This 
potentially largest application is determined knowing the total 
size of the available utilization, and the characteristics of the 
application: Su and fg^j. 

Example 7.5: For example, if our total available utilization 
on all the processors is of 1.81 then we have to distribute this 
utilization according to the probabilities in fg^ Considering 
the numerical example for processes given in Example 7.3, 
the largest application will be estimated to have a total of 20 
processes: three processes of utilization 0 . 02 , 5 of 0.05, 9 pro- 
cesses (almost half, /^^^(O.l) = 0.45) of utilization 0.1, and 3 of 
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0.2. If the number of processes for a particular dimension is 
not an integer, then we use the ceiling. 

■ 

After we have determined the largest F we apply a bin- 
packing algorithm [Mar90] using the best-fit policy in which we 
consider processes as the objects to be packed, and the available 
utilization as containers. The total utilization of unpacked pro- 
cesses Uq relative to the total utilization of the process set Uf 
gives the Cf metric: Cf = {JJqI Uf) • 100. 

Example 7.6: In the case presented in Figure 7.2b U^ = 0.3 
and U 2 = 0.25, and P 2 represents 45% of the largest possible 
future application. In this case Cf = 45%. However, in 
Figure 7.2c were we were able to completely map the future 
application Cf = 0%. 

■ 

7.3.2 Criteria Related to Messages 

The first criterion for messages is similar to the one defined for 
processes. Thus, the slack sizes in the message schedule table 
MEDL (see Section 3.2.1) resulted after implementation o^T 
on top of x|/ should be such that they best accommodate a given 
family of applications F characterized by the set and the 
probability distribution for messages. 

Example 7.7; Let us consider the example in Figure 7.2, 
where we have two processors and the applications Vjiand 
^current already mapped. Application F has two mes- 
sages and m^. It can be observed that the best configura- 
tion, taking into consideration only slack sizes, is to have a 
contiguous slack. However, in reality, it is almost impossible 
to map and schedule the current application such that a con- 
tiguous slack is obtained. Not only is it impossible, but it is 
also undesirable from the point of view of the second design 
criterion, discussed next. On the other side, as we can see 
from Figure 7.2b, if we schedule T current it fragments 
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too much the slack, it is impossible to fit F because there 
is no slack that can accommodate message m^. A situation as 
the one depicted in Figure 7.2c is desirable, where the 
resulted slack sizes can accommodate the characteristics of 
the r application. 

■ 

In order to measure the degree to which the slack sizes in a 
given design alternative fit the future applications, we provide 
the metric C’^. The metric indicates how much of the communi- 
cations of the largest future application which theoretically 
could be mapped on the system if the slacks on the bus would be 
summed, can be mapped on the current design alternative. The 
messages accounting for the largest amount of communication 
are determined, as shown above for processes, knowing the total 
size of the available slack, and the characteristics of the applica- 
tion: Sfj and 

C]” is calculated similarly to the metric Cf but, instead of 
packing the processes as objects, we try to pack the messages 
into the available slack on the bus. Cj” is then the total size of 
unpacked messages, relative to the total size of messages in the 
largest future application. 

Example 7.8: For Figure 7.2b, where could not be sched- 
uled, Cf is 75% because of 6 bytes represents 75% of the 
total message sizes of 8 b 5 des. For the design alternative in 
Figure 7.2c is 0% because all the messages have been 
scheduled. 

■ 

We have just discussed a metric for how well the sizes of the 
slacks fit a possible future application. A similar metric is 
needed to characterize the distribution of slacks over time. 

During implementation of T current we aim for a slack distribu- 
tion such that the future application with the smallest expected 
period and with the expected necessary bandwidth b^gcci 
inside the period T^i^, can be accommodated. The minimum 
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over the slacks inside each period, which is available peri- 
odically to the messages of Y future’ i® ^2 metric. 

Example 7.9: In Figure 7.3 we present a message schedule 
scenario. We consider a situation with = 120 ms and 
b^eed - 40 ms. The length of the schedule table is 360 ms, and 
the already scheduled messages of \|/and are depicted 

in black. 

Let us consider the situation in Figure 7.3a. In the first 
period Period 1, there are 40 ms of slack available on 
the bus, in the second period 80 ms, and in the third period 
no slack is available. Hence, the total slack a future applica- 
tion with a period can use on the bus in each period is 
= min(40, 80, 0) = 0 ms. In this case, the messages cannot 
be scheduled. However, if we move to the left in the 
schedule table, we are able to create, in Figure 7.3b, 40 ms of 
slack in each period, resulting a C 2 = 40 ms = 6„gg(^. 



Tmin = 120 
bneed ~ 40 



Bus 



Bus 



^ Period P- 



360 ms 



mi ^ 



Period 2 ^ Period 3-^ 



Periodic slack 



a) C 2 = min(40, 80, 0) = 0ms 



b) = min(40, 40, 40) = 40ms 



\|/and r 



current 



Slack 
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7.3.3 Cost Function and Exact Problem Formulation 

In order to capture how well a certain design alternative meets 
the requirement b stated previously, in Chapter 5 we have com- 
bined the design metrics in a cost function C. In the case of 
event-driven systems and the metrics presented in this chapter, 
the cost function is constructed similarly, as: 

C=u;i (C^)^ -I- + ^2 max{0, - C^) , (7.3) 

where the metric values are weighted by the constants w^. 

Our mapping and scheduling strategy will try to minimize 
this function. A design alternative that does not meet the second 
design criterion for messages is not considered a valid solution. 
Thus, using the last term, we strongly penalize the cost function 
if is not satisfied, by using high values for the weight. 

At this point, we can give an exact formulation to our problem, 
which is s 5 monymous with the problem addressed in Chapter 5 
in the context of time-driven systems: Given an existing set of 
applications \|/which are already mapped and scheduled, and an 
application Fg„,.,.g„^ to be mapped on top of we are interested to 
find a mapping and scheduling of which satisfies all 

deadlines such that the existing applications are disturbed as 
little as possible. In our context, this means finding the subset 
Qq \|/of old applications to be remapped and rescheduled such 
that we produce a valid solution for u Qand the total cost 

of modification R{Q), as introduced in Section 2.3.2, is mini- 
mized. At the same time, the solution should minimize the cost 
function C, considering a family of future applications character- 
ized by the sets Su and the functions fgjj and as well as the 
parameters and 6„gg^. 
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7.4 Mapping and Scheduling Strategy 

The mapping and scheduling proposed in this section is similar 
to the one outlined in Section 5.4.2 for time-driven systems. The 
differences lie in the formulation of the design criteria and met- 
rics, how are these used to select potential moves, and in the def- 
inition of the subset selection heuristics, which are tuned for 
event-driven systems. 

As shown in Figure 7.4, our mapping and scheduling strategy 
(MH) has two main steps. In the first step we try to obtain a valid 
solution for u Qso that the total modification cost R(Q) is 

minimized (Q e \|ris the subset of existing applications that have 
to be modified to accommodate Starting from such a 

solution, a second step iteratively improves on the design in 
order to minimize the cost function C (Equation 7.3). 

We iteratively improve the design using a transformational 
approach. A new design is obtained from the current one by per- 
forming a transformation called move. We consider the following 
moves: moving a process to a different node, and moving a mes- 
sage to a different slack on the bus. We only perform valid 
moves, which result in a schedulable system. The intelligence of 
the Mapping Heuristic lies in how the potential moves are 
selected. For each iteration a set of potential moves is generated 
by the PotentialMoveX functions. The SelectMoveX functions then 
evaluate these moves with regard to the respective metrics and 
selects the best one to be performed. 

7.4.1 The Initial Mapping and Scheduling 

The first step of MH consists of an iteration that tries subsets 
Q e \|/with the intention to find that subset Q = which pro- 
duces a valid solution for u Q such that i?(Q) is minimized 

(lines 3-23 in Figure 7.4). 

Given a subset Q, the InitialMappingScheduling function (IMS) 
constructs a mapping and schedule for T current Gthat meets the 
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MappingSchedulingStrategy (MH) 

1 ii=0 

2 

3 - Step 1 : try to find a valid schedule for T^^urrent that minimizes R{U^ 

4 repeat 

5 succeeded=\MS(\if\ u O - initial mapping and scheduling 

6 ASAP(r U q>: ALAP(r„u qi 

7 - compute worst case ASAP-ALAP intervals for messages 

8 if succeeded then 

9 - try to satisfy the second message related design criterion 

10 repeat 

11 - find moves with highest potential to maximize Cf* 

12 move_set = PotentialMoveC^CPcurrenf ^ 

13 “ select and perform move which improves most 

14 move = SelectMoveC^(move_set) 

15 Perform(move) 

1 6 succeeded =C 2 ^ b„ged 

17 until succeeded or limit reached 

18 end if 

19 if succeeded and R(Q) smallest so far then 

2 0 ^aiid = ^ solution^gii^ = solution^^,,^^, 

21 end if 

22 Q= NextSubset(C^ - try another subset 

23 until termination condition 

24 

2 5 if not succeeded then modify architecture; go to step 1 ; end if 

26 

27 

28 - Step 2: try to improve the cost function C 

2 9 solution^^,,^nt= solution^^ii^-, 0^^,= 

3 0 repeat — find moves with highest potential to minimize C 

3 1 move_set = PotentialMoveCi" (r current ^in) 

3 2 u PotentialMoveCy^(r u 

3 3 - select move which improves C and does not invalidate 

34 - the second message related design criterion 

3 5 move = SelectMoveCi (move_sef) 

3 6 Perform(move) 

37 until has not changed or limit reached 

38 

end MappingSchedulingStrategy 

Figure 7.4; The Mapping and Scheduling Strategy 
to Support Iterative Design 
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deadlines (both for processes in ^current those in Q), without 
wonying about the design criteria in Section 7.3. For IMS we 
used as a starting point the mapping algorithm introduced in 
[Tin92] , based on a simulated annealing strategy. We have mod- 
ified the mapping algorithm in [Tin92] to consider during map- 
ping a set of previous applications that have already been 
mapped, and to schedule the messages according to the TDMA 
protocol, using the MM approach (Section 6.5.2). The schedula- 
bility test that checks a particular mapping alternative is per- 
formed according to our schedulability analysis presented in 
Section 6.5. 

If IMS succeeds in finding a mapping and a schedule which 
meet the deadlines, this is not yet a valid solution. In order to 
produce a valid solution we iteratively try to satisfy the second 
design criterion for messages (lines 10-17 in Figure 7.4). In 
terms of our metrics, that means a mapping and scheduling such 
that C 2 ^ Potential moves can be the shifting of messages 
inside their worst case (largest) [ASAP, ALAP] interval in order to 
improve the periodic slack. In Potential MoveC^, line 12, we also 
consider movement of processes, trying to place the sender and 
receiver of a message on the same processor and, thus, reducing 
the bus load. SelectMoveC^, line 14, evaluates these moves with 
regard to the second design criterion and selects the best one to 
be performed. 

Example 7.10: Consider Figure 7.3a. In Period 3 on node 
iVj there is no available slack. However, if we move message 
with 40 ms to the left into Period 2, as depicted in 
Figure 7.3b, we create a slack in Period 3, thus the periodic 
slack on the bus will be min(40, 40, 40) = 40, instead of 0. 
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7.4.2 Incremental Mapping and Scheduling Strategy 

If Step 1 of the MH algorithm (Figure 7.4) has succeeded, a map- 
ping and scheduling of u Qhas been produced which cor- 

responds to a valid solution. In addition, Qhas the smallest 
minimization cost (minimization of the modification cost is 
introduced in Section 2.3.2 and detailed in Section 7.4.3). Start- 
ing from this valid solution, the second step of the MH strategy 
(lines 30—37) tries to improve on the design in order to minimize 
the cost function C. In a similar way as during Step 1, we itera- 
tively improve the design by successive moves, without invali- 
dating the second criterion achieved in the first loop. 

The loop ends when there is no improvement to be achieved on 
the first two terms of the cost function, or a limit imposed on the 
number of iterations has been reached (line 37). For each itera- 
tion, the algorithm preforms moves that have the highest chance 
to improve the cost function. The moves are generated in the 
PotentialMoveCi functions (lines 31-32), and are evaluated and 
selected based on the respective metrics in the SelectMoveCi 
function (line 35). We now briefly discuss the PotentialMoveC^" and 
PotentialMoveC!|" functions (PotentialMoveCg has been discussed in 
the previous section). 

PotentialMoveC^ 

Let 17^ be the total utilization factor of the largest future applica- 
tion r and Uq the utilization of that part which cannot be 
mapped in the current design alternative. This function is 
responsible for selecting moves of processes from one node to 
another so that Ci =(Uq/ Uf) • 100 is reduced. 

Moving a process Pj, with the utilization factor U^, from a node 
Nj, where it is currently mapped, to a node Nf^ will increase the 
available utilization on node W to I/jy -f U^, and decrease the 
available utilization on to — Ui. To find out Uq in this new 
case would mean executing the bin-packing with the processes 
of the future application as objects and the new available utili- 
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zation configuration as containers. This can take significant exe- 
cution time since it has to be done for each potential move. 

In Section 7.3 we have explained how we can estimate the pro- 
cesses that make up the largest future application T based 
on the total available utilization and the characterization of 
future applications. Let us assume that Y consists of the set 
‘Pfmax = ^Pfv Pf2’ Pfn'i processes, and that = [Pfi, ..., 

P/^} are the ones that cannot be mapped in the current design 
alternative. The total utilization requested by the unmapped 
processes is Uq = Ufi + Uf^^-y -h ... -f- Uf^. For the potential move of 
Pj from Nj to Nf^ we have to recalculate Cf which means deter- 
mining Uq. 

In order to reduce the execution time needed by the bin- 
packing algorithm, we do not consider all the processes of Yf^^^ 
as objects to be packed. We consider for repacking only those 
processes belonging to F that had to be removed from to 
make room for Pj, together with those that were already left 
outside. Our heuristic considers that to make room for P^ on 
node Nf^ we remove those processes c Yf^^^ mapped on Nf^ 
which have the smallest utilization factor, since they are the 
ones that should be easiest to fit on other nodes. The metric used 
by SelectMoveC^ to rank this move is the sum of the utilization 
factors of processes which are left out after trying to repack the 
pQ u !Pj^* set. 

Out of the best moves according the previous metric, we 
encourage those that have the smallest impact on the schedula- 
bility analysis, since we would like to keep the system schedula- 
ble. This means moving processes that have low priority (do not 
have a large impact on other processes) and have a response 
time that is considerably smaller than their deadline {D^ — is 
large). 
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PotentialMoveC^ 

In order to avoid excessive fragmentation of the slack on the bus 
we will consider moving a message to a position that “snaps” to 
another existing message. A message is selected for potential 
move if it has the smallest “snapping distance,” i.e., in order to 
attach it to other message it has to travel the smallest distance 
inside the schedule table. We also consider moves that try to 
increase the individual slacks sizes. Therefore, we first elimi- 
nate slack that is unusable: it is too small to hold the smallest 
message of the future application. Then, the slacks are sorted in 
ascending order and the smallest one is considered for improve- 
ment. Such improvement of a slack is performed through mov- 
ing a nearby message, but avoiding to create as a result an even 
smaller individual slack. 

7.4.3 Minimizing the Modification Cost 

In the first step of our mapping strategy, described in Figure 7.4, 
we iterate on subsets Q. to search for a valid solution which also 
minimizes the total modification cost i?(Q). As a first attempt, 
the algorithm searches for a valid implementation of 
without disturbing the existing applications (Q = 0). If no valid 
solution is found successive subsets Q produced by the function 
NextSubset are considered, until a terminating condition is met. 

In Section 5.4.3 of Chapter 5 we have presented several 
approaches to the implementation of the NextSubset function in 
the context of time-driven systems. The same strategies are 
used in this chapter, but now in the case of event-driven sys- 
tems. The difference lies in the formulation of the A metrics 
which guide the subset selection process. 

The first approach to the implementation of the NextSubset 
function is an exhaustive search algorithm (ES), similar to the 
one presented in Section 5.4.3. As shown in that chapter, the 
exhaustive approach that finds an optimal solution can be used 



217 




Chapter 7 



only for small sets of applications. The second approach pre- 
sented in Section 5.4.3 is a greedy heuristic, here named Ao?-/ioc 
Subset Selection (AS), which finds very quickly a valid solution, if 
one exists, with the drawback that the corresponding total mod- 
ification cost is higher than the optimal one. However, as we 
argue in Chapter 5 an intelligent heuristic should be able to 
identify the reasons due to which a valid solution has not been 
found and use this information when selecting applications to be 
included in Q The next section presents such a heuristic in the 
case of event-driven systems. 

Subset Selection Heuristic (SH) 

There can be two possible causes for not finding a valid solution: 
an initial mapping which meets the deadlines has not been pro- 
duced, or the second criterion is not satisfied. 

Let us investigate the first reason. If an application Fj is 
schedulable, this means that all its processes meet their dead- 
lines. If IMS determines that the application is not schedulable 
this means that at least one of the processes Pi missed its dead- 
line: Ri > Di- Besides the intrinsic properties of the application 
that can lead to this situation, process can miss its deadline 
also because of the interference of higher priority processes that 
are mapped on the same node with Pj, processes that can also 
belong to other applications. In this situation we say that there 
is a conflict with processes belonging to other applications. We 
are interested to find out which applications are responsible for 
conflicts encountered by our T c.urrent-> ^nd not only that, but also 
which ones are flexible enough to move away in order to avoid 
these conflicts (Dj - is large). 

IMS determines a metric that characterizes the degree of 
conflict and the flexibility of application F ^ in relation to T current- 
A set of applications Q will be characterized, in relation to 

^current’ 
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A(Q)= A.. (7.4) 

r, e Q 

The metric A(Q) will be used by our subset selection heuristic 
if IMS has failed to produce a solution which satisfies the dead- 
lines. An application with a larger Aj is more likely to lead to a 
valid schedule if included in Q 

Basically, Aj is the total amount of interference caused by 
higher priority processes of F j to processes in T current- 1*'®’” ^ Pro- 
cess Pj, the interference Iji from a higher priority process Pj 
mapped on the same node, is the time that Pj delays the execu- 
tion of Pj, and is given by: 



Iji= 



■J +P.‘ 
J I 






(7.5) 



where Jj is the release jitter of process Pj and a detailed descrip- 
tion of how it is calculated in the context of the MM approach for 
message scheduling over TTP is given in Section 6.5.2. Figure 7.5 
presents in more detail how Aj is calculated. 

If the initial mapping was successful, the first step of MH could 
fail during the attempt to satisfy the second design criterion for 
messages. In this case, the metric Aj is computed in a different 
way. It will capture the potential of an application F j to improve 
the metric Cf* if remapped together with P current- Thus, for the 
improvement of C™ we consider a total number of moves from all 
the non-frozen applications (determined using 
PotentialMoveC 2 (^, see Section 7.4.2). For each move that has as 
subject ntje F j, we increment the metric Aj with the predicted 
improvement on C^. 

MH starts by trying an implementation oiY current with Q.= 0. If 
this attempt fails, because of one of the two reasons mentioned 
above, the corresponding metrics Aj are computed for all F j e \|/ 
Our heuristic SH will then start by finding the ad-hoc solution 
produced by the AS algorithm (this will succeed if there 
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DeltaMetrics(r<,^,,en^ Q 

1 for each non frozen r, e Q do 

2 A,= 0 

3 end for 

4 

5 foreach P,e rc„,,e„fdo 

6 if R, > D, then 

7 for each non frozen Ti^g Q do 

8 - hp(P;) is the set of processes with higher priority than 

9 for each Py€ Ti,nhp{P;)do 

10 A,= A,+ Cj*[(Jj+ R)/T;\ 

11 end for 

12 end for 

13 end if 

14 end for 

15 

16 return A 

end DeltaMetrics 

Figure 7.5: Determining the Delta Metrics 

exists any solution) with a corresponding cost Rj^g = R(Q^) and 
a Aj^g = SH now continues by trying to find a solution with 

a more favorable Q (a smaller total cost R). Therefore, the 
thresholds R^ax = ^ as ^min - ^ experiments we 

considered n = 2) are set. For generating new subsets Q the 
function NextSubset now follows a similar approach like ES but in 
a reverse direction, towards smaller subsets, and it will consider 
only subsets with a smaller total cost than R^^x ^ larger A 
than (a small A means a reduced potential to eliminate the 
cause of the initial failure). Each time a valid solution is found, 
the current values of R^^x ^min updated in order to fur- 

ther restrict the search space. The heuristic stops when no sub- 
set can be found with A > A^i^, or a certain imposed limit has 
been reached (e.g., on the total number of attempts to find new 
sets). 
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7.5 Experimental Evaluation 

For the evaluation of our mapping strategies we first used appli- 
cations containing 40, 80, 160, 240 and 320 processes represent- 
ing the application generated for experimental purpose. 

Thirty applications were generated for each dimension, thus a 
total of 150 applications were used for experimental evaluation. 
We considered an architecture consisting of 10 nodes of different 
speeds. For the communication channel we considered a trans- 
mission speed of 256 Kbps and a length below 20 meters. The 
maximum length of the data field in a bus slot was 8 bytes. All 
experiments were run on a SUN ultra 10. 

7.5.1 Modification Cost Minimization Heuristics 

The first result concerns the quality of the designs obtained with 
our mapping strategy MH using the search heuristic SH com- 
pared to the case when the ad-hoc approach AS and the exhaus- 
tive search ES are used for subset selection. 

For each of the five application dimensions generated we have 
considered a set of existing applications \|iconsisting of 160, 240, 
320, 400 and 480 processes, respectively. The sets contained 4, 
6, 8, 10 and 12 applications, each application with an associated 
modification cost assigned manually in the range 10 to 100. The 
dependencies between applications were such that the total 
number of subsets resulted for each set ywere 8, 32, 128, 256, 
and 1024. We have considered that the future applications T 
consist of a process set of 80 processes, randomly generated 
according to the following specifications: Sjj= {0.02, 0.05, 0.1, 
0.15, 0.2}, fsjJSjj) = (0.1, 0.25, 0.45, 0.15, 0.05), = (2, 4, 6, 8 

bytes}, = {0.2, 0.5, 0.2, 0.1}, = 250 ms, and = 20 

ms. 

MH has been used to produce a valid solution for each of the 
150 process sets representing on top of the existing appli- 

cations \j/using the ES, AS and SH approaches to subset selection. 
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For each of the resulted valid solutions, there corresponds a min- 
imum modification cost Figure 7.6a compares the three 

approaches to subset selection based on the modification cost 
needed in order to obtain a valid solution. The exhaustive 
approach ES is able to obtain valid solutions at the optimum 
(smallest) modification cost, (e.g., less than 400, in average, for 
systems with 12 applications consisting of a total of 480 pro- 
cesses), while the ad-hoc approach AS needs in average 3.11 
times more costly modifications in order to obtain valid solutions 
(e.g., more than 1100 for 480 processes in Figure 7.6a). However, 
in order to find the optimal re-mapping the ES approach needs 
large computation times. For example, it can take more than 35 
minutes, in average, in order to find the smallest cost subset to 
be remapped that leads to a valid solution in the case we have 12 
applications (corresponding to 480 processes in Figure 7.6b). 
From Figure 7.6 we can see that the proposed heuristic SH per- 
forms quite well, needing only 1.84 times larger costs, in aver- 
age, in order to obtain a valid schedule at a computation cost 
comparable with the fast ad-hoc approach AS (see Figure 7.6b). 
For the results in Figure 7.6 we have eliminated those situations 
in which a valid solution has not been produced by MH (which 
means that there is no solution regardless of the modification 
cost). 

7.5.2 Incremental Mapping and Scheduling Heuristics 

Next, we were interested to investigate the quality of the map- 
ping heuristic MH compared to a so called ad-hoc mapping 
approach (AM). 

To concentrate on this, we have considered that no modifica- 
tions are allowed to the applications in \|x The AM approach is a 
simple, straightforward solution to produce designs which, to a 
certain degree, support an incremental process. AM tries to 
evenly balance the available utilization remaining after map- 
ping the current application. The quality of the designs obtained 
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Figure 7.6: Evaluation of the Modification 
Cost Minimization Heuristics 
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with MH and AM were compared with a near-optimal mapping 
and schedule obtained with a Simulated Annealing strategy (SA) 
strategy (Appendix A), that minimizes the cost function C (Sec- 
tion 7.3.3). One of the drawbacks of the SA strategy is that in 
order to find near-optimal solutions it needs very large computa- 
tion times. Such a strategy, although useful for the final stages 
of the system synthesis, cannot be used inside a design space 
exploration cycle. 

MH, SA and AM have been used to map each of the 150 applica- 
tions representing on the existing applications \j/ For 

each of the resulted designs, the objective function C has been 
computed. Very long and expensive runs have been performed 
with the SA algorithm for each process set and the best ever solu- 
tion produced has been considered as the near-optimum for that 
process set. We have compared the cost function obtained for the 
150 applications considering each of the three mapping algo- 
rithms. Figure 7.7a presents the average percentage deviation 
of the cost function obtained with the MH and AM from the value 
of the cost function obtained with the near-optimal scheme. We 
have excluded from the results in Figure 7.7, 28 solutions 
obtained with AM for which the second design criterion for mes- 
sages has not been met, and thus the objective function has been 
strongly penalized. The average run-times of the algorithms, in 
minutes, are presented in Figure 7.7b. The SA approach per- 
forms best in terms of quality at the expense of a large execution 
time. The execution time can be up to 40 minutes for large appli- 
cations of 320 processes. MH performs very well, and is able to 
obtain good quality solutions in a very short time. AM is very 
fast, but since it does not address explicitly the design criteria 
presented in Section 7.2 it has the worst quality of solutions, 
according to the cost function. 

The most important aspect of the experiments is determining 
to which extent the mapping strategies proposed in this chapter 
really facilitate the implementation of future applications. To 
find this out, we have mapped applications of 40, 80, 160 and 
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240 processes representing the application on top of the 

previously generated existing applications \\i After mapping and 
scheduling each of these applications we have tried to add a new 
application F to the resulted system. F consists of 80 
processes, randomly generated according to the same specifica- 
tions presented before. The experiments have been performed 
two times, using first MH*^ and then AM for mapping Fg„,.,.g„^. In 
both cases we were interested if it is possible to find a valid 
implementation for T on top of Fgj^,.,.g,j^ using the initial map- 
ping algorithm IMS. Figure 7.8a shows the number of successful 
implementations in the two cases. In the case has been 

mapped with MH*, this means using the design criteria and met- 
rics proposed in this chapter, we were able to find a valid sched- 
ule for 56% of the total mapping attempts with IMS using F future- 
However, using AM to map T current ^ situation where 

IMS is able to find valid schedules in only 31% of the cases. 

Another observation from Figure 7.8 is that when the avail- 
able utilization is large, as in the case has only 40 pro- 

cesses, it is easy for both MH* and AM to find a mapping that 
allows adding future applications. However, as Fgj^^,.g„; grows to 
80, only MH* is able to find a mapping of Fg^,.,.g,j^ that supports an 
incremental design process, accommodating more than 60% of 
the future applications, while using AM only less than 25% are 
accommodated. If the remaining utilization is very small, after 
we map a T current of 240, it becomes practically impossible to map 
new applications without modifying the current system. 

However, in the case the mapping heuristic is allowed to mod- 
ify the existing system as discussed in this chapter then we are 
able to increase the number of successfully mapped F appli- 
cations to 73% from the total instead of only 56%. The percent- 
age of accommodated T applications, for different 
dimensions of if modifications are allowed on the exist- 



1. MH is the same mapping heuristic as in Figure 7.4, but in which we do 
not allow modifications to the existing applications. 
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Figure 7.8: Percentage of Future Applications 
Successfully Mapped 
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ing system, is shown by the diagram MH in Figure 7.8b. After 
mapping a with 80 processes using MH we are able to 

accommodate 88% of the future applications, compared to only 
61% in the case we do not allow modifications to the existing sys- 
tem (MH*). Such an increase is, of course, expected. The impor- 
tant aspect, however, is that it is obtained not by randomly 
selecting old applications to be remapped, but by performing 
this selection such that the total modification cost is minimized. 

7.5.3 The Vehicle Cruise Controller 

Finally, we considered an example implementing a vehicle 
cruise controller (CC): 

•The CC has 32 processes, and is modeled as an un-mapped 
conditional process graph, presented in Figure 2.9 on 
page 40. 

• The cruise controller is to be mapped on an architecture con- 
sisting of 5 nodes, interconnected by TTP, as presented in 
Figure 2.7a on page 37. 

• The software architecture for event-triggered systems used 
by the CC is introduced in Section 3.4. 

The system \\f consists of 80 processes generated randomly. 
The CC is the Fg„^^g„^ application to be mapped. We have also gen- 
erated 30 future applications of 40 processes each with the char- 
acteristics of the CC, which are t 5 rpical for automotive 
applications. By mapping the CC using MH* we were able to later 
map 18 of the future applications, while using AM only 6 of the 
future applications could be mapped. MH* and AM do not allow 
modifications of the existing system. When modifications are 
allowed, using the MH approach, we are able to map 26 of the 30 
future applications. 

As the experiments have shown, the design criteria proposed 
in this chapter, for event-driven systems, are able to drive the 
optimization process towards solutions that support an incre- 
mental design process. 
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This and the previous part of the book have addressed ET and 
TT systems, respectively. In the next part, we will consider multi- 
cluster systems, designed as interconnected clusters of proces- 
sors, where each such cluster can be either TT or ET. 
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Chapter 8 
Schedulability Analysis and 
Bus Access Optimization for 
Multi-Cluster Systems 



This chapter presents an approach to schedulability analy- 
sis and bus access optimization for multi-cluster distributed 
embedded systems consisting of time-triggered and event-triggered 
clusters, interconnected via gateways, as introduced in Section 3.5. 

On the time-triggered clusters (TTC) the processes are sched- 
uled based on a non-preemptive static cyclic scheduling policy, 
and messages are sent using the TTP, while on the event-trig- 
gered clusters (ETC) we use a fixed-priority preemptive schedul- 
ing policy for processes, and messages are sent via the CAN bus. 

We have proposed a schedulability analysis for multi-cluster 
systems, including a buffer size and worst case queuing delay 
analysis for the gateways, responsible for routing inter-cluster 
traffic. Optimization heuristics for the priority assignment and 
synthesis of bus access parameters aimed at producing a sched- 
ulable system with minimal buffer needs have also been devel- 
oped. 
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This chapter is organized in five sections. The next section 
introduces the problems that we are addressing in this chapter. 
Section 8.2 presents our proposed schedulability analysis for 
multi-cluster systems, and Section 8.3 uses this analysis to 
drive the optimization heuristics used for system synthesis. The 
last section present the experimental results. 



8.1 Problem Formulation 

As input to our problem we have an application T given as a set 
of conditional process graphs mapped on an architecture consist- 
ing of a TTC and an ETC interconnected through a gateway node. 
The set of nodes on the TTC is denoted with the ETC consists 
of the set of nodes 9^, and the gateway node is denoted with Nq. 

We are interested first to find a system configuration denoted 
by a 3-tuple \|/= <(1), p, 7i> such that the application r is schedula- 
ble. Determining a system configuration \)/ means deciding on: 

• The set ([) of the offsets corresponding to each process and 
message in the system (see Section 6.2). The offsets of pro- 
cesses and messages on the TTC practically represent the 
local schedule tables and MEDLs. 

• The TTC bus configuration p, indicating the sequence and 
size of the slots in a TDMA round on the TTC. 

• The priorities of the processes and messages on the ETC, cap- 
tured by K 

Once a configuration leading to a schedulable application is 
found, we are interested to find a system configuration that min- 
imizes the total queue sizes needed to run a schedulable applica- 
tion. The approach presented in this chapter can be extended to 
cluster configurations where there are several ETCs and TTCs 
interconnected by gateways. 

Example 8.1: Let us consider the example in Figure 8.1 
where we the application mapped on the a two-cluster 
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Figure 8.1: Scheduling Examples for Multi-Clusters 
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system as illustrated in Figure 3.8 on page 60. In the system 
configuration of Figure 8.1 we consider that, on the TTP bus, 
the gateway transmits in the first slot (Sq) of the TDMA 
round, while node Ni transmits in the second slot ((S^). The 
priorities inside the ETC have been set such that priority > 
priority and priority > priority p^. 

In such a setting, will miss its deadline, which was set 
at 200 ms. However, changing the system configuration as in 
Figure 8 . 1 b, so that slot S-^oiN^ comes first, we are able to 
send and m 2 sooner, and thus reduce the response time 
and meet the deadline. The response times and resource 
usage do not, of course, depend only on the TDMA configura- 
tion. In Figure 8.1c, for example, we have modified the prior- 
ities of P 2 and P 3 so that P 2 is the higher priority process. In 
such a situation, P 2 is not interrupted when the delivery of 
message m 2 was supposed to activate P 3 and, thus, eliminat- 
ing the interference, we are able to meet the deadline, even 
with the TTP bus configuration of Figure 8.1a. 



8.2 Multi-Cluster Scheduling 

In this section we propose an analysis for hard real-time appli- 
cations mapped on multi-cluster systems. The aim of such an 
analysis is to find out if a system is schedulable, i.e., all the tim- 
ing constraints are met. In addition to this, we are also inter- 
ested to bound the queue sizes needed to run a schedulable 
applications. 

On the TTC, an application is schedulable if it is possible to 
build a schedule table such that the timing requirements are 
satisfied. On the ETC, the answer whether or not a system is 
schedulable is given by a schedulability analysis, and we use the 
schedulability analysis outlined in Section 6.4.1. 
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In Section 6.4.1 the release jitter of a destination process D 
depends on the communication delay between sending and 
receiving an incoming message m\ = r^. However, in the 

case of a multi-cluster system, we will use offsets to capture the 
communication delays, and not the release jitter. Thus, the offset 
of a process will be determined such that it contains the commu- 
nication delay due to the incoming message. For example, the 
offset O2 of process P 2 in Figure 8.1a has been set such that it 
accounts for the delay due to message sent from the gateway 
transfer process T to the process P 2 via the CAN bus. 

Moreover, determining the schedulability of an application 
mapped on a multi-cluster system cannot be addressed sepa- 
rately for each type of cluster, since the inter-cluster communi- 
cation creates a circular dependency: the static schedules 
determined for the TTC influence through the offsets the 
response times of the processes on the ETC, which on their turn 
influence the schedule table construction on the TTC. 

Example 8.2; In Figure 8.1a, placing m-^ and m 2 in the 
same slot leads to equal offsets for P 2 and P3. Because of this, 
Pg will interfere with Pg (which would not be the case if m 2 
sent to Pg would be scheduled in Round 4) and thus the 
placement of P4 in the schedule table has to be accordingly 
delayed to guarantee the arrival of mg. 

■ 

In our response time analysis we consider the influence 
between the two clusters by making the following observations: 

• The start time of process P^ in a schedule table on the TTC is 
its offset Oj. 

• The worst-case response time of a TT process is its worst- 
case execution time, i.e. = Cj (TT processes are not preempt- 
able). 

• The worst-case response times of the messages exchanged 
between two clusters have to be calculated according to the 
schedulability analysis to be described in Section 8.2.1. 
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• The offsets have to be set by a scheduling algorithm such 
that the precedence relationships are preserved. This means 
that, if process Pj depends on process P^, the following condi- 
tion must hold: Oj > Oi + r^. Note that for the processes on a 
TTC receiving messages from the ETC this translates to set- 
ting the start times of the processes such that a process is 
not activated before the worst-case arrival time of the mes- 
sage from the ETC. In general, offsets on the TTC are set such 
that all the necessary messages are present at the process 
invocation. 

The MultiClusterScheduling algorithm in Figure 8.2 receives as 
input the application F, the mapping M, the system configura- 
tion \|( and produces the offsets (j) and worst-case response times 
P- 

The algorithm sets initially all the offsets to 0 (line 2). Then, 
the worst-case response times are calculated using the Response- 
TimeAnalysis function (line 5) using the feasible analysis provided 
in [Tin94b] . The fixed-point iterations that calculate the 
response times at line 4 will converge if processor and bus loads 
are smaller than 100% [Tin94b]. Based on these worst-case 
response times, we determine new values for the offsets 
using a list scheduling algorithm (line 7). 

The multi-cluster scheduling algorithm loops until the degree 
of schedulability 5^ of the application F cannot be further 
reduced (lines 9-22). In each loop iteration, we select a new off- 
set Oj from the set of (|f offsets (line 10), and run the response 
time analysis (line 12) to see if the degree of schedulability has 
improved (line 13). That offset Oj is selected, which corresponds 
to the unschedulable process Pj (i.e., its worst-case response 
time Tj is greater than its deadline Z)j) with the largest difference 
Tj — Dj. If 5p has not improved, we continue with the next offset in 

When a new offset leads to an improved 5p, we exit the 
for-each loop 10-21 that examines offsets from The loop 
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MultiClusterScheduling(r, M, \|^ 

1 - determines the set of offsets (j) and worst-case response times p 

2 for each O, € <|) do O, =0 end for - initially all offsets are zero 

3 — determine initial values for the worst-case response times 

4 — according to the analysis in Section 8.2.1 

5 p = ResponseTimeAnalysis(r, M, \|/(|)) 

6 - determine new values for the offsets, based on p 

7 ListScheduling(r, Af, vji p) 

8 5p = °o~ consider the system unschedulable initially 

9 repeat — iteratively improve the degree of schedulability 8p 

10 for each do — for each newly calculated offset 

1 1 0°'^ = (|)i 0-, (]x 0/ = (t/’®^ O ,®®'^ - set the new offset, remember old 

12 p®®'^= ResponseTimeAnalysis(r, M, \||(|)) 

13 5 p'’®'^= SchedulabilityDegree(r, p) 

14 if d/”®^ < 8p then — the schedulability has improved 

15 - offsets are recalculated using p"®*^ 

16 (|/’®'^= ListScheduling(r, M, p'’®^) 

17 break - exit the for-each loop 

18 else - the schedulability has not improved 

19 ^Oj- O,— - restore the old offset 

2 0 end if 

21 end for 

22 until 8 p has not changed or a limit is reached 

23 return p, (K 8 ^^ 

end MultiClusterScheduling 

Figure 8.2: The MultiClusterScheduling Algorithm 



iteration 9-22 continues with a new set of offsets, determined by 
ListScheduling at line 16 , based on the worst-case response times 
puew corresponding to the previously accepted offset. 

In the multi-cluster scheduling algorithm, the calculation of 
offsets is performed by the list scheduling algorithm presented 
in Figure 8.3. In each iteration, the algorithm visits the pro- 
cesses and messages in the ReadyList. A process or a message in 
the application is placed in the ReadyList if all its predecessors 
have been already scheduled. The list is ordered based on the pri- 
orities presented in Section 4.3.2. The algorithm terminats when 
all processes and messages have been visited. 
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In each loop iteration, the algorithm calculates the earliest 
time moment (denoted with the variable offset) when the process 
or message nodei in the application graph T can start (lines 5-7). 
There are four situations: 

1. The visited node in the application graph is an ET message. 
In this case, the offset of message mj is updated to offset. 

2. The node is a TT message. In this case, the message is sched- 



ListScheduling(r, M, v|f p) — determines the set of offsets (() 

1 ReadyList = source nodes of all process graphs in the application 

2 while ReadyList^ 0 do 

3 node, = Hea6(ReadyList) 

4 offset = 0 — determine the earliest time when an activity can start 

5 for each direct predecessor nodej of nodOj do 

6 offset = max( offset, Oy + rj) 

1 end for 

8 if node; is a message mi then 

9 if m, is an et message then 

10 O, =offset — update the message offset 

11 else — m, is a tt message 

12 <round, slot> - ScheduleMessage(offsef, s^, M(S(m,))) 

13 - set the TT message offset based on the round and slot 

14 round* Ttdma+ O slo, 

15 endif 

16 endif 

17 else - node, is a process P, 

18 if /Vf(P,) e sy^then — process P, is mapped on the ETC 

19 O/ = offset - the ETC process can start immediately 

2 0 else — process P, is mapped on the TTC 

21 - P, has to wait for processor M(P) to become available 

22 O, = max(offsef, ProcessorAvailable(/W(P))) 

23 end if 

24 end if 

2 5 Update(ReadyList) 

2 6 end while 
2 7 return offsets (|) 
end ListScheduling 



Figure 8.3: ListScheduling Algorithm 
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uled using the ScheduleMessage function from Section 4.3.1, 
which returns the round and the slot where the frame has 
been placed (line 12 in Figure 8.3). Once the message has 
been scheduled, we can determine its offset and worst-case 
response time (Figure 8.3, line 14). Thus, the offset is equal 
to the start of the slot in the TDMA round, and the worst-case 
response time is the slot length. 

3. The algorithm visits a process Pj mapped on an ETC node. A 
process on the ETC can start as soon as its predecessors have 
finished and its inputs have arrived, hence Oj = offset (line 
19). However, Pj might, later on, experience interference 
from higher priority processes. 

4. Process Pj is mapped on a TTC node. In this case, besides 
waiting for the predecessors to finish executing, Pj will also 
have to wait for its processor M(Pj) to become available 
(line 22). The earliest time when the processor is available is 
returned by the ProcessorAvailable function. 

Let us now turn the attention back to the multi-cluster sched- 
uling algorithm in Figure 8.2. The algorithm stops when the 5p 
of the application F is no longer improved, or when a limit 
imposed on the number of iterations has been reached. Since in 
a loop iteration we do not accept a solution with a larger 5p, the 
algorithm will terminate when in a loop iteration we are no 
longer able to improve 6p by modifying the offsets. 

8.2.1 SCHEDULABILITY AND RESOURCE ANALYSIS 

The analysis in this section is used in the ResponseTimeAnalysis 
function in order to determine the response times for processes 
and messages on the ETC. It receives as input the application F, 
the offsets (|) and the priorities n, and it produces the set p of 
worst case response times. 

We have used the response time analysis outlined in 
Section 6.4.1 for the CAN bus (Equations 6.6, 6.9, 6.10, and 6.11). 
However, the worst-case queuing delay for a message 
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(Equation 6.9) is calculated differently depending on the type of 
message passing employed: 

1. From an ETC node to another ETC node (in which case 
represents the worst-case time a message m has to spend in 
the Outj^^ queue on ETC node Ni). An example of such a 
message is in Figure 8.1, which is sent from the ETC node 

to the gateway node Nq. 

2. From a TTC node to an ETC node is the worst-case time 

a message m has to spend in the OutQj^ queue). In 
Figure 8.1, message m-i is sent from the TTC node N-^ to the 
ETC node A^ 2 - 

3. From an ETC node to a TTC node (where captures the 

time m has to spend in the Outq-pp queue). Such a message 
passing happens in Figure 8.1, where message is sent 
from the ETC node to the TTC node N-^ through the 
gateway node Nq where it has to wait for a time in the 
Outprpp queue. 

The messages sent from a TTC node to another TTC node are 
taken into account when determining the offsets (ListScheduling, 
Figure 8.2), and thus are not involved directly in the ETC analy- 
sis. 

The next sections show how the worst-queuing delays and the 
bounds on the queue sizes are calculated for each of the previous 
three cases. 

From ETC to ETC and from TTC to ETC 

The analyses for and are similar. Once m is the high- 
est priority message in the Outc^N queue, it will be sent by the 
gateway’s CAN controller as a regular CAN message, therefore the 
same equation for can be used: 
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where q is the number of busy periods being examined, and 
Wf^iq) is the width of the level-m busy period starting at time 
QTm- 









^rrij G hp{m) 






'^J 






( 8 . 2 ) 



The intuition is that m has to wait, in the worst case, first for 
the largest lower priority message that is just being transmitted 
(B^) as well as for the higher priority ruj e hp{m) messages that 
have to be transmitted ahead of m (the second term). In the 
worst case, the time it takes for the largest lower priority mes- 
sage e lp{m) to be transmitted to its destination is: 



^ e lp(m) ^ 



max 



( 8 . 3 ) 



Note that in our case, lp{m) and hpim) also include messages 
produced by the gateway node, transferred from the TTC to the ETC. 

We are also interested to bound the size of the and 

s^' of the Outj^^ queue. In the worst case, message m, and all the 
messages with higher priority than m will be in the queue, 
awaiting transmission. Summing up their sizes, and finding out 
what is the most critical instant we get the worst-case queue 
size: 





/ 
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^ _ max 




Ti 






V \fmj G hp{m) 


J 


y 



where and Sj are the sizes of message m and mp respectively. 



From ETC to TTC 

The time a message m has to spend in the Outjvpp queue in the 
worst case depends on the total size of messages queued ahead 
of m {Outrppp is a FIFO queue), the size Sq of the gateway slot 
responsible for carrying the CAN messages on the TTP bus, and 
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the frequency Tj^dma with which this slot is circulating on the 
bus, and thus, the width of the level-m busy period starting at 
time qT^ is: 



(q) = B^ + 



m 



(9 + l)S^ + /^(a;^(9)) 



S 



G 



'^TDMA’ 



where is the total size of the messages queued ahead of m. 
Those messages ruj e hpim) are ahead of m, which have been 
sent from the ETC to the TTC, and have higher priority than m: 



¥ 






e hp{f) 









( 8 . 6 ) 



where the message jitter is in the worst case the response 
time of the sender process, 

The blocking term is the time interval in which m cannot 
be transmitted because the slot Sq of the TDMA round has not 
arrived yet. In the worst case (i.e., the message m has just 
missed the slot Sq), the frame has to wait an entire round 
for the slot Sq in the next TDMA round. 

Determining the size of the queue needed to accommodate the 
worst case burst of messages sent from the CAN cluster is done 
by finding out the worst instant of the following sum: 



TTP 

^Out 



maxt 






( 8 . 7 ) 



8.3 Scheduling and Optimization Strategy 

Once we have a technique to determine if a system is schedula- 
ble, we can concentrate on optimizing the total queue sizes. Our 
problem is to synthesize a system configuration \\f such that the 
application is schedulable, i.e., the condition^ 

vg, € r, (8.8) 
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holds, and the total queue size is minimized^: 

CAN TTP ^ Ni 
^ total ” ^Out ^ ^Out ^Out ' 

W; e ETC 



(8.9) 



In the next section, we propose a resource optimization strat- 
egy based on a hill-climb heuristic that uses an intelligent set of 
initial solutions in order to efficiently explore the design space. 



8.3.1 Scheduling and Buffer Optimization Heuristic 

The basic idea of our buffer optimization heuristic is to find, as a 
first step, a solution with the smallest possible response times, 
without considering the buffer sizes, in the hope of finding a 
schedulable system. This is achieved through the 
OptimizeSchedule function, outlined in Figure 8.4. Then, a hill- 
climbing heuristic [Ree93] iteratively performs moves intended 
to minimize the total buffer size while keeping the resulted sys- 
tem schedulable. 

The OptimizeSchedule function is a greedy approach which 
determines an ordering of the slots and their lengths, as well as 
priorities of messages and processes in the ETC, such that the 
degree of schedulability 6p (see Section 6.6.1) of the application 
is maximized. 

As an initial TTC bus configuration P, OptimizeSchedule assigns 
in order nodes to the slots and fixes the slot length to the 
minimal allowed value, which is equal to the length of the 
largest message generated by a process assigned to ATj, Sj = <N|, 

1. The worst-case response time of a process graph G; is calculated based 
on its sink node as vq. = If local deadlines are imposed, 

they will also have to be tested in the schedulability condition. 

1. On the TTC, the synchronization between processes and the TDMA bus 
configuration is solved through the proper synthesis of schedule tables, 
thus no output queues are needed. Input buffers on both TTC and ETC 
nodes are local to processes. There is one buffer per input message and 
each buffer can store one message instance (see explanation to 
Figure 3.8 on page 60). 
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sizes^giiggf> (line 5 in Figure 8.4). Then, the algorithm starts with 
the first slot (line 8) and tries to find the node which, when 
transmitting in this slot, will maximize the degree of 
schedulability 5p (lines 9-37). 

Simultaneously with searching for the right node to be 
assigned to the slot, the algorithm looks for the optimal slot 
length (lines 14—32). Once a node is selected for the first slot and 
a slot length fixed (S| = S^estj 36), the algorithm continues 
with the next slots, trying to assign nodes (and to fix slot 
lengths) from those nodes which have not yet been assigned. 

When calculating the length of a certain slot we consider the 
feedback from the MultiClusterScheduling algorithm which recom- 
mends slot sizes to be tried out. Before starting the actual opti- 
mization process for the bus access scheme, a scheduling of the 
initial solution is performed which generates the recommended 
slot lengths. We refer the reader to Section 4.4.1 for details con- 
cerning the generation of the recommended slot lengths. 

In the OptimizeSchedule function the degree of schedulability 5p 
is calculated based on the response times produced by the 
MultiClusterScheduling algorithm (line 21). For the priorities used 
in the response time calculation we use the “heuristic optimized 
priority assignment” (HOPA) approach (line 16) from [Gut95], 
where priorities for processes and messages in a distributed 
real-time system are determined, using knowledge of the factors 
that influence the timing behavior, such that the degree of 
schedulability is improved. 

The OptimizeSchedule function also records the best solutions in 
terms of 5p and in the seed_solutions list in order to be used 
as the starting point for the second step of our OptimizeResources 
heuristic. 

In the first step of our buffer size optimization heuristic 
OptimizeResources, outlined in Figure 8.5, we have tried to obtain 
a bus configuration that improves the degree of schedulability of 
the application. Once a schedulable system is obtained, our goal 
in the second step is to minimize the buffer space. Our design 
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OptimizeSchedule(r, M) 

1 - given an application F produces the configuration \\f= <(|), (3, n> 

2 - leading to the smallest 5p 

3 

4 - start by determining an initial TTC bus configuration [3 

5 for each slot S, g (3 do S, = <N„ size smallest^ ®rid for 

6 

7 - find the best allocation of slots, the TDMA slot sequence 

8 for each slot S, g (3 do 

9 for each node A/y g TTC do 

10 -- allocate Nj tentatively to S^, AT- gets slot Sj 

11 Sj= <Nj, sizes> 

12 Sj= <Nj, sizes^ 

13 -- determine best size for slot 

14 for each slot size g recomendedJengths(Sj) do 

15 " calculate the priorities according to HOPA heuristic 

16 71 = HOPA 

17 — determine the offsets (|), 

18 - thus obtaining a complete system configuration \|/ 

19 Sj=<Np size> 

2 0 = «t>. P. 

21 ([)= MultiClusterScheduling(r, W, %urren() 

22 — remember the best configuration so far, 

23 -- add it to the seed configurations 

24 if 8Y-{%urrent) best SO far then 

2 5 'lt)es( ~ ^current 

2 6 ^best ~ 

2 7 add to seed_solutions 

2 8 end if 

2 9 determine St^iai for %arrent 

3 0 if is best so far and r is schedulable 

3 1 then add %urrem fo seed_solutions end if 

32 end for 

3 3 end for 

34 “ make binding permanent, use the corresponding to 

3 5 if a S/,esf exists 

3 6 then S, = end if 

3 7 end for 

38 

3 9 return %es^ 5r(%esf) ’ seed_solutions 
end OptimizeSchedule 

Figure 8.4; The OptimizeSchedule Algorithm 
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space exploration in the second step of OptimizeResources (lines 
12-22) is based on successive design transformations (generat- 
ing the neighbors of a solution) called moves. For our heuristics, 
we consider the following types of moves: 

• moving a process or a message belonging to the TTC inside its 
[ASAP, ALAP] interval calculated based on the current values 
for the offsets and response times; 

• swapping the priorities of two messages transmitted on the 
ETC, or of two processes mapped on the ETC; 



OptimizeResources(r) 

1 

2 - Step 1: try to find a schedulable system 

3 seeaLso/uf/ons = OptimizeSchedule(r, M) 

4 - if no schedulable configuration has been found, 

5 - modify mapping and/or architecture 
e if r is not schedulable for then 

7 modify mapping 

8 go to Step 1 

9 end if 

10 
11 

12 - Step 2: try to reduce the resource need, minimize 

13 for each \j/in seed_solutions do 

14 repeat 

15 - find moves with highest potential to minimize 

16 move_sef = GenerateNeighbors(v|j) 

17 - select move which minimizes 

18 - and does not result in an un-schedulable system 

19 move = Se\ecXMove(move_set) 

20 Perform(mov'e) 

21 until Sfgfa, has not changed or limit reached 

22 end for 

23 

24 return system configuration queue sizes 
end OptimizeResources 



Figure 8.5: The OptimizeResources Algorithm 
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• increasing or decreasing the size of a TDMA slot with a cer- 
tain value; 

• swapping two slots inside a TDMA round. 

The second step of the Optimize Resources heuristic starts from 
the seed solutions (line 13) produced in the previous step, and 
iteratively preforms moves in order to reduce the total buffer 
size, (Equation 8.9). The heuristic tries to improve on the 
total queue sizes, without producing un-schedulable systems. 
The neighbors of the current solution are generated in the 
GenerateNeighbours function (line 16), and the move with the 
smallest is selected using the SelectMove function (line 19). 
Finally, the move is performed, and the loop reiterates. The 
iterative process ends when there is no improvement achieved 
on or a limit imposed on the number of iterations has been 
reached (line 21). 

The general limitation of a hill-climbing heuristic is that it 
can get stuck into a local optimum. In order to improve the 
chances to find good values for the algorithm has to be exe- 
cuted several times, starting with a different initial solution. 
The intelligence of our OptimizeResources heuristic lies in the 
selection of the initial solutions, recorded in the seed_solutions 
list. The list is generated by the OptimizeSchedule function which 
records the best solutions in terms of 6r and 

Seeding the hill climbing heuristic with several solutions of 
small will guarantee that the local optima are quickly 
found. However, during our experiments, we have observed that 
another good set of seed solutions are those that have high 
degree of schedulability 5p. Starting from a highly schedulable 
system will permit more iterations until the system degrades to 
an un-schedulable configuration, thus the exploration of the 
design space is more efficient. 
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8.4 Experimental Evaluation 

For evaluation of our algorithms we first used applications gen- 
erated for experimental purpose. We considered two-cluster 
architectures consisting of 2, 4, 6, 8 and 10 nodes, half on the TTC 
and the other half on the ETC, interconnected by a gateway. 
Forty processes were assigned to each node, resulting in applica- 
tions of 80, 160, 240, 320 and 400 processes. Message sizes were 
randomly chosen between 8 and 32 bytes. Thirty examples were 
generated for each application dimension, thus a total of 150 
applications were used for experimental evaluation. Worst-case 
execution times and message lengths were assigned randomly 
using both uniform and exponential distribution. All experi- 
ments were run on a SUN Ultra 10. 

In order to provide a basis for the evaluation of our heuristics 
we have developed two simulated annealing (SA) based algo- 
rithms (see Appendix A). Both are based on the moves presented 
in the previous section. The first one, named SA Schedule (SAS), 
was set to preform moves such that 5p is minimized. The second 
one, SA Resources (SAK), uses function to be min- 

imized. Very long and expensive runs have been performed with 
each of the SA algorithms, and the best ever solution produced 
has been considered as close to the optimum value. 

8.4.1 Scheduling and Bus Access Optimization Heuristics 

The first experimental result concerns the ability of our heuris- 
tics to produce schedulable solutions. We have compared the 
degree of schedulability 5]- obtained from our OptimizeSchedule 
(os) heuristic (Figure 8.4) with the near-optimal values obtained 
by SAS. Figure 8.6 presents the average percentage deviation of 
the degree of schedulability produced by OS from the near-opti- 
mal values obtained with SAS. Together with OS, a straightfor- 
ward approach (SF) is presented. For SF we considered a TTC bus 
configuration consisting of a straightforward ascending order of 
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allocation of the nodes to the TDMA slots; the slot lengths were 
selected to accommodate the largest message sent by the respec- 
tive node, and the scheduling has been performed by the 
MultiClusterScheduling algorithm in Figure 8.2. 

Figure 8.6 shows that when considering the optimization of 
the access to the communication channel, and of priorities, the 
degree of schedulability improves dramatically compared to the 
straightforward approach. The greedy heuristic OptimizeSchedule 
performs well for all the dimensions, having run-times which 
are more than two orders of magnitude smaller than with SAS. 
In the figure, only the examples where all the algorithms have 
obtained schedulable systems were presented. The SF approach 
failed to find a schedulable system in 26 out of the total 150 
applications. 




Figure 8.6: Comparison of the Scheduling 
Optimization Heuristics 
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8.4.2 Buffer Optimization Heuristic 

Next, we are interested to evaluate the heuristics for minimiz- 
ing the buffer sizes needed to run a schedulable application. 
Thus, we compare the total buffer need obtained by the 
OptimizeResources (OR) function with the near-optimal values 
obtained when using simulated annealing, this time with the 
cost function To find out how relevant the buffer optimiza- 
tion problem is, we have compared these results with the 
obtained by the OS approach, which is interested only to obtain a 
schedulable system, without any other concern. As shown in 
Figure 8.7a, OR is able to find schedulable systems with a buffer 
need half of that needed by the solutions produced with OS. The 
quality of the solutions obtained by OR is also comparable with 
the one obtained with simulated annealing (SAR). 

Another important aspect of our experiments was to deter- 
mine the difficulty of resource minimization as the number of 
messages exchanged over the gateway increases. For this, we 
have generated applications of 160 processes with 10, 20, 30, 40, 
and 50 messages exchanged between the TTC and ETC clusters. 
Thirty applications were generated for each number of mes- 
sages. Figure 8.7b shows the average percentage deviation of 
the buffer sizes obtained with OR and OS from the near-optimal 
results obtained by SAR As the number of inter-cluster messages 
increases, the problem becomes more complex. The OS approach 
degrades very fast, in terms of buffer sizes, while OR is able to 
find good quality results even for intense inter-cluster traffic. 

When deciding on which heuristic to use for design space 
exploration or system synthesis, an important issue is the execu- 
tion time. In average, our optimization heuristics needed a cou- 
ple of minutes to produce results, while the simulated annealing 
approaches (SAS and SAR) had an execution time of up to three 
hours. 
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a) Bounds on total buffer size obtained 
with OS, OR, SAS 




OS, OR from SAR 



Figure 8.7: Comparison of the Buffer Size 
Minimization Heuristics 
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8.4.3 The Vehicle Cruise Controller 

Finally, we considered a real-life example implementing a vehi- 
cle cruise controller introduced in Section 2.3.3: 

• The conditional process graph that models the cruise control- 
ler has 32 processes, and is presented in Figure 2.9 on 
page 40, 

• and it was mapped on an architecture consisting of a TTC and 
an ETC, each with 2 nodes, interconnected by a gateway, as in 
Figure 2.7b on page 37. 

• The software architecture for multi-cluster systems, used by 
the CC, is presented in Section 3.5. 

• We considered one mode of operation with a deadline of 250 
ms. 

The straightforward approach SF produced an end-to-end 
response time of 320 ms, greater than the deadline, while both 
the OS and SAS heuristics produced a schedulable system with a 
worst-case response time of 185 ms. The total buffer need of the 
solution determined by OS was 1020 b 3 des. After optimization 
with OR a still schedulable solution with a buffer need reduced 
by 24% has been generated, which is only 6% worse than the 
solution produced with SAR 

As a conclusion, the optimization heuristics proposed are able 
to increase the schedulability of the applications and reduce the 
buffer size needed to run a schedulable application. 

In this chapter, the main contribution was the development of 
a schedulability analysis for multi-cluster systems. However, in 
the case of both TTP and CAN protocols, several messages share 
one frame, in the hope to utilize resources more efficiently. 
Therefore, in the next chapter we propose optimization heuris- 
tics for determining frame packing configurations that are able 
to reduce the cost of the resources needed to run a schedulable 
application. 
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Partitioning and Mapping 
for Multi-Cluster Systems 



Using the analysis proposed in the previous chapter, sev- 
eral design optimization problems can be addressed. In the 
remaining part of the book we will address problems which are 
characteristic to applications distributed across multi-cluster 
systems consisting of heterogeneous TT and ET networks. In this 
chapter, we are interested in the partitioning of the processes of 
an application into time-triggered and event-triggered domains, 
and their mapping to the nodes of the clusters. The goal is to pro- 
duce an implementation which meets all the timing constraints 
of the application. 

The next section presents the design optimization problems 
we are addressing in this chapter, and Section 9.2 presents our 
proposed heuristics for the design optimization of multi-cluster 
systems. The last section of the chapter presents the evaluation 
of the heuristics. 
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9 . 1 Partitioning and Mapping 

In this chapter, by partitioning we denote the decision whether a 
certain process should be assigned to the TT or the ET domain 
(and, implicitly, to a TTC or an ETC, respectively). Mapping a pro- 
cess means assigning it to a particular node inside a cluster. 

Very often, the partitioning decision is taken based on the 
experience and preferences of the designer, considering aspects 
like the functionality implemented by the process, the hardness 
of the constraints, sensitivity to jitter, legacy constraints, etc. 
Let fPbe the set of processes in the application F. We denote with 
!Pj. c !P the subset of processes which the designer has assigned 
to the TT cluster, while ^P^ c !P contains processes which are 
assigned to the ET cluster. 

Many processes, however, do not exhibit certain particular 
features or requirements which obviously lead to their imple- 
mentation as TT or ET activities. The subset 2^ = !P \ u 2’^) of 
processes could be assigned to any of the TT or ET domains. Deci- 
sions concerning the partitioning of this set of activities can lead 
to various trade-offs concerning, for example, the schedulability 
properties of the system, the amount of communication 
exchanged through the gateway, the size of the schedule tables, 
etc. 

For part of the partitioned processes, the designer might have 
already decided their mapping. For example, certain processes, 
due to constraints like having to be close to sensors/actuators, 
have to be physically located in a particular hardware unit. They 
represent the sets 2^ c and c of already mapped TT 
and ET processes, respectively. Consequently, we denote with 
T*j = ‘Pj^\ the TT processes for which the mapping has not yet 
been decided, and similarly, with (pI = 2’^ \ 2^ the unmapped ET 
processes. The set T* = 2^^ u 2^^ u 2^ then represents all the 
unmapped processes in the application. 

The mapping of messages is decided implicitly by the mapping 
of processes. Thus, a message exchanged between two processes 
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on the TTC (ETC) will be mapped on the TTP bus (CAN bus) if 
these processes are allocated to different nodes. If the communi- 
cation takes place between two clusters, two message instances 
will be created, one mapped on the TTP bus and one on the CAN 
bus. The first message is sent from the sender node to the gate- 
way, while the second message is sent from the gateway to the 
receiving node. 

Example 9.1; Let us illustrate some of the issues related to 
partitioning in such a context. In the example presented in 
Figure 9.1 we have an application^ with six processes, to 
Pg, and four nodes, N-^ and N2 on the TTC, on the ETC and 
the gateway node Nq. The worst-case execution times on 
each node are given to the right of the application graph. 
Note that N 2 is faster than N^, and an “X” in the table means 
that the process is not allowed to be mapped on that node. 
The mapping of P^ is fixed on ATj, P3 and Pg are mapped on 
N2, P2 and P5 are fixed on iVg, and we have to decide how to 
partition P4 between the TT and ET domains. Let us also 
assume that process P5 is the highest priority process on iVg. 
In addition, Pg and Pg have each a deadline, Dg and Dg, 
respectively, as illustrated in the figure by thick vertical 
lines. 

We can observe that although Pg and P4 do not have indi- 
vidual deadlines, their mapping and scheduling has a strong 
impact on their successors, Pg and Pg, respectively, which are 
deadline constrained. Thus, we would like to map P4 such 
that not only Pg can start on time, but P4 also starts soon 
enough to allow Pg to meet its deadline. 

As we can see from Figure 9 . 1 a, this is impossible to 
achieve by mapping P4 on the TTC node Ng. It is interesting 
to observe that, if preemption would be allowed in the TT 
domain, as in Figure 9 . 1 b, both deadlines could be met. This, 



1. Communications are ignored for this example. 



257 




Chapter 9 



however, is impossible on the TTC where preemption is not 
allowed. Both deadlines can be met only if P4 is mapped on 
the slower ETC node N^, as depicted in Figure 9 . 1 c. In this 
case, although P4 competes for the processor with P5, due to 
the preemption of P4 by the higher priority P5, all deadlines 
are satisfied. 

■ 

For a multi-cluster architecture the communication infra- 
structure has an important impact on the design and, in partic- 
ular, the mapping decisions. 

Example 9.2: Let us consider the example in Figure 9 . 2 . 

We assume that P^ is mapped on node Ni and P3 on node 
on the TTC, and we are interested to map process P2. P2 is 
allowed to be mapped on the ETC node N2 or on the ETC node 



a) 



b) 
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Figure 9.1: Partitioning Example 



258 






SCHEDULABILITY ANALYSIS AND OPTIMIZATION FOR MULTI-ClUSTERS 



N^, and its execution times are depicted in the table to the 
right of the application graph. 

In order to meet the deadline, one would map P2 on the 
node it executes fastest, N 2 on the TTC, see Figure 9.2a. How- 
ever, this will lead to a deadline miss due to the TTP slot con- 
figuration which introduces communication delays. The 
application will meet the deadline only if P2 is mapped on the 
slower node, i.e., node A^4 in the case in Figure 9.2b^. Not 
only is N4 slower than N2, but mapping P2 on N4 will place P2 
on a different cluster than P4 and Pg, introducing extra com- 
munication delays through the gateway node. However, due 
to the actual communication configuration, the mapping 
alternative in Figure 9.2b is desirable. 




Figure 9.2; Mapping Example 



1. Process T in Figure 9.2b executing on the gateway node Nq is responsi- 
ble for transferring messages between the TTP and CAN controllers. 
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9.1.1 Exact Problem Formulation 

As an input we have an application F given as a set of process 
graphs (Section 2.3.1) and a two-cluster system consisting of a 
TT and an ET cluster (Section 3.5). As introduced previously, 
and are the sets of processes already partitioned into TT and 
ET, respectively. Also, c Jy and 3^^ c % are the sets of 
already mapped TT and ET processes. 

We are interested to find a partitioning for processes in 2^ = 
T \ {‘Prp u 2’g) and decide a mapping for processes in (P* = u Pg 
u 2^, where Prp\ pjf, and Pg = Pg\ Pg such that imposed 
deadlines are guaranteed to be satisfied. 



9.2 Partitioning and Mapping Strategy 

The design problem formulated in the previous section is NP- 
complete. The MultiClusterConfiguration strategy (MCC) we propose 
to solve the partitioning and mapping problem has three steps: 

1. In the first step, we decide very quickly on an initial bus ac- 
cess configuration. The initial bus access configuration is de- 
termined, for the TTC, by assigning in order nodes to the 
slots ()Sj = Ni) and fixing the slot length to the minimal al- 
lowed value, which is equal to the length of the largest mes- 
sage in the application. For the ETC we calculate the mes- 
sage priorities k based on the deadlines of the receiver pro- 
cesses. 

2. Once an initial bus access configuration has been decided, in 
the second step, we decide an initial partitioning and map- 
ping, using the algorithm described in Section 9.2.1. After 
the initial partitioning and mapping are obtained, the appli- 
cation is scheduled using the MultiClusterScheduling algorithm 
outlined in the previous chapter. 

3. If the application is schedulable the optimization strategy 
stops. Otherwise, it continues with the third step by using an 
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iterative improvement heuristic, namely PMHeuristic, pre- 
sented in Section 9.2.2, to improve the partitioning and map- 
ping obtained in the first step. 

After these steps the system can be further optimized by 
improving the access to the communication infrastructure. Such 
an optimization step is presented in the next chapter. 

9.2.1 Initial Partitioning and Mapping (ipm) 

Our initial mapping and partitioning algorithm (InitialPM, 
Figure 9.3) receives as input the merged graph obtained by 
merging all the graphs of the application T (as described in 
Figure 5.5 on page 111). 

The IPM algorithm uses a list scheduling based greedy 
approach. A process is placed in the ready list L if all its pre- 
decessors have been already scheduled. In each iteration of the 
loop (lines 2—7), all ready processes from the list L are investi- 
gated, and that process P, is selected for mapping by the Select- 
Process function, which has the largest delay 5j = r • -i- /■. In the 
previous equation, Tj is the response time of process Pj on the 
fastest node in and Zj is the critical path starting from pro- 
cess Pj, defined as: 



InitialPM(^) - Initial Partitioning and Mapping 

1 L= {source of — start with the first node of the merged graph 

2 while £ 5^ 0 do - visits ready processes in the order of list scheduling 

3 P= SelectProcess(i:) 

4 N= SelectNode(?\4>) 

5 M{P) = N — map process P on node N 

6 X = U pdate Ready List( x) 

7 end while 
end InitialPM 



Figure 9.3: The Initial Partitioning and Mapping 
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l_ ^ max X (9 

where Kif^ is the path from process Pi to the sink node of 
(not including P^), and r^. is the response time of a process or 
message on Tt^^. The response times are calculated using the 
MultiClusterScheduling function, under the following assumptions: 

• Every yet unpartitioned/unmapped process Pj e T* is consid- 
ered mapped on the fastest node from the list of potential 
nodes 9{p . 

• The worst-case response time for messages sent or received 
hy yet unpartitioned/unmapped processes is considered 
equal to zero. 

Example 9.3: Let us consider the design example in 
Figure 9.4 where we have five processes, P^ to P 5 , and three 
nodes, on the TTC, N2 on the ETC and the gateway node 
Nq. The initial bus configuration, consisting of the slots 
order and size, together with the ET message priorities, is 
also given. The mapping of P 3 is fixed on P 5 is fixed on 
N2, and we have to decide where to partition and map P^, P 2 
and P4. In the first iteration of IPM, SelectProcess has to 
decide between Pj and P 2 which are ready for execution. The 
critical path of Pj is = max(r„j^ - 1 - rg - 1 - + r^, + r^ + 

-H Tg) = max(0 -I- 40 -H 40 -H 40, 0 - 1 - 30 h- 0 - 1 - 40) = 120, while I 2 = 
^mg + ^4 + + ^5 = 0 -I- 30 -I- 0 -t- 40 = 70. Thus, the delay of P^ 

is 61 = = 30 -I- 120 = 150, and the delay of P 2 is Sg = 

+ ^2 = 60 -E 70 = 130^. Therefore, SelectProcess will select 
Pi because it has a larger delay. 

■ 

Once a process P^ is selected, all mapping alternatives of Pj to 
nodes^ in 9{p, are tested by the SelectNode function. Out of these 



1 . According to the first assumption, both and P2 are considered 
mapped on the fastest node. 
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alternatives, SelectNode returns that node which leads to the 
smallest end-to-end delay on the application graph: 







= O, 






■ 



(9.2) 



In the previous equation, is the offset of process when 
mapped on node (i.e., the earliest possible starting time tak- 
ing into account the predecessors and the communication delay 




Figure 9.4: Design Example 



2. 0 ^, is the set of nodes on which process could, potentially, he executed. 
If^he process is already partitioned to a certain cluster, only nodes in 
that cluster are considered. 
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of the incoming messages) calculated by our scheduling algo- 
rithm. 

The worst-case response time is equal to the worst-case 
execution time if is in the TTC 9^). If Nf^ is in the 

ETC G 9\^), the worst-case response time is calculated 
according to the Equations 6.4 and 6.5. 

The third term of the delay represents the critical path 
from process Pj to the sink node (as introduced in Equation 9.1) 
in the case Pj is mapped on node Nf^. The delay is calculated 
by the MultiClusterScheduling function, under the same assump- 
tions mentioned earlier. 

Example 9.4: Let us go back to the example in Figure 9.4. 
IPM has decided in the first two iterations of the while loop 
(lines 2—7 in Figure 9.3) that P^ should be mapped on Ni and 
P2 on N2 . In the third iteration, P4 has been selected by Select- 
Process, and now the mapping alternatives on Ni and N 2 are 
tested by the SelectNode function. According to Equation 9.2, 
if P4 is mapped on we have -1- = 120 -1- 

40 + (40 -I- 40) = 240 (see Figure 9.4a). Similarly, for the alter- 
native on N 2 we have 6^2 = 0^2 + r ^2 + 1^2 = gO -1- 30 -1- (0 -1- 
40) = 150 (see Figure 9.4b). Thus, P4 will be mapped on N 2 
which produces the smallest delay of 150. IPM will finally 
produce the schedulable solution presented in Figure 9.4b. 

■ 

9.2.2 Partitioning and Mapping Heuristic (pmh) 

If, after the initial partitioning, mapping and bus setup we do 
not obtain a schedulable application, we apply an iterative 
improvement algorithm, the PMHeuristic in Figure 9.5. The algo- 
rithm receives as input the application F, the initial partitioning 
and mapping produced by IPM and produces a partitioning 
and mapping for processes in T*. 

We investigate each unschedulable graph Gj g F, i.e., the 
response time is larger than the deadline Dq., Our heuristic is 
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to perform changes to the mapping of processes in F that would 
reduce the critical path of Gj, and thus the worst-case response 
time 

In each iteration, the algorithm selects that unschedulable 
process graph which has the maximum delay ^Gi = - Doi 

between its response time and the deadline (line 2). Let us 
denote the maximum delay with and the corresponding 

graph with Next, we determine the critical path “P^p of the 
process graph For example, for the process graph in 



PMHeuristic(r, M) — Partitioning and Mapping Heuristic 

1 while (3 G, € Fa fQ> Dq) and {^^ax improved in the previous iteration) dc 

2 = maximum of r^'- Dq^ V G, e r a rg> Dq. 

3 G„,ax = graph corresponding to A^ax 

4 !P(^p= FindCriticalPath(G„,a,f) 

5 for each P, e T^p do - find changes with a potential to improve Tq 

6 if /W(P/) e then 

7 L/sf = ProposedTTChanges(P,) 

8 else - in this case M{P) e 

9 List - ProposedETChanges(P^ 

10 end if 

11 for each ProposedChange e List do - determine the improvement 

12 Pedorm{ProposedChangey, MultiClusterScheduling(r, M) 

13 = maximum of rQ. - Dq., V G, e r a rQ. > Dq. 

14 if A„,a,^ smallest so far then 

15 BestChange = ProposedChange 

16 end if 

17 \Jndo{ProposedChange) 

18 end for 

19 - apply the move improving the most 

2 0 If 3 SesfC/ianpe then Perform(6esfC/7ange) end if 

21 end for 

22 end while 

23 return M 

end PM Heuristic 



Figure 9.5: The Partitioning and Mapping Heuristic 
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Figure 9.4 scheduled as in case (a), the critical path is composed 
ofP2.-P4 andPg. 

The intelligence of the heuristic lies in how it determines 
changes (i.e., design transformations) to the mapping of pro- 
cesses that potentially can lead to a shortening of the critical 
path (lines 7 and 9). The list of proposed changes List leading to 
a potential improvement are then evaluated (lines 11-18) to find 
out the change that produces the largest reduction of 
which is finally applied to the system configuration (line 20 ). 
Reducing ^rnax means, implicitly, reducing the response time of 
the process graph investigated in the current iteration. The 
algorithm terminates if all graphs in the application are schedu- 
lable, or no improvement to is found. 

Since a call to MultiClusterScheduling that evaluates the changes 
is costly in terms of execution time, it is crucially to find out a 
short list of proposed changes that will potentially lead to the 
largest improvement. Looking at Equation 9.2, we can observe 
that the length of the critical path would be reduced if, for a 
process Pj g T^p, we would: 

1. reduce the offset Oj (first term of Equation 9.2); 

2 . decrease the worst-case response time (second term); 

3. reduce the critical path from Pj to the sink node (third term). 

To reduce (1) we have to reduce the delay of the communica- 
tion from P/s predecessors to Pp Thus, we consider transforma- 
tions that would change the mapping of process Pj and of 
predecessors of Pj such that the communication delay is mini- 
mized. However, only those predecessors are considered for 
remapping which actually delay the execution of Pj. Let us go 
back to Figure 9.4, and consider that PMH starts from an initial 
partitioning and mapping as depicted in Figure 9.4a. In this 
case, to reduce the offset O 4 of process P 4 , we will consider map- 
ping P 4 on node N 2 as depicted in Figure 9.4b, reducing thus the 
offset from 120 to 80. 
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The approach to reduce ( 2 ) depends on the t}^e of process. 
Both for TT and ET processes we can decrease the worst-case 
execution time Cj by selecting a faster node. For example, in 
Figure 9.4, by moving from N 2 to N-^ we reduce its worst-case 
execution time from 40 to 30. However, for ET processes we can 
further reduce Pj by investigating the interference from other 
processes on (Equation 6.5). Thus, we consider mapping pro- 
cesses with a priority higher than Pj on other nodes, reducing 
thus the interference. 

Point (3) is concerned with the critical path from process Pj to 
the sink node. In this case, we are interested to reduce the delay 
of the communication from Pj to its successor process on the crit- 
ical path. This is achieved by considering changes to the map- 
ping of Pj or to the mapping of the successor process (e.g., by 
including them in the same cluster, same processor, etc.). For 
example, in Figure 9.4a, the critical path of P 4 is enlarged by the 
communication delay due to exchanged by P 4 on the TTC with 
P 5 on the ETC. To reduce the length of the critical path we will 
consider mapping P 4 to N 2 , and thus the communication will 
take place on the same processor. 

9.3 Experimental Evaluation 

For the evaluation of our algorithms we used applications of 50, 
100, 150, 200, and 250 processes (all unpartitioned and 
unmapped), to be implemented on two-cluster architectures con- 
sisting of 2, 4, 6 , 8 , and 10 different nodes, respectively, half on 
the TTC and the other half on the ETC, interconnected by a gate- 
way. 

Thirty examples were randomly generated for each applica- 
tion dimension, thus a total of 150 applications were used for 
experimental evaluation. We generated both graphs with ran- 
dom structure and graphs based on more regular structures like 
trees and groups of chains. Execution times and message 
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a) Percentace of schedulable applications 




Figure 9.6: Comparison of the 
Partitioning and Mapping Heuristics 
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lengths were assigned randomly using both uniform and expo- 
nential distribution within the 10 to 100 ms, and 2 to 8 b5des 
ranges, respectively. The experiments were done on SUN Ultra 
10 computers. 

We were interested to evaluate the proposed approaches. 
Hence, we have implemented each application, on its corre- 
sponding architecture, using the MultiClusterConfiguration (MCC) 
strategy presented in Section 9.2. Figure 9.6a presents the num- 
ber of schedulable solutions found after each step of our optimi- 
zation strategy (N/S stands for “not schedulable”). Together with 
the MCC steps. Figure 9.6a also presents a straightforward solu- 
tion (SF). The SF approach performs a partitioning and mapping 
that tries to balance the utilization among nodes and buses. This 
is a configuration which, in principle, could be elaborated by a 
careful designer without the aid of optimization tools like the 
one proposed here. 

Out of the total number of applications, only 19% were sched- 
ulable with the implementation produced by SF. However, using 
our MCC strategy, we are able to obtain schedulable applications 
in 76% of the cases: 30% after step two (IPM), and 76% after step 
three (PMH). It is easy to observe that, for all application dimen- 
sions, by performing the proposed optimization steps, large 
improvements over the straightforward configuration could be 
produced. Moreover, as the applications become larger, it is more 
difficult for SF to find schedulable solutions, while the optimiza- 
tion steps of MCC perform very well. For 150 processes, for exam- 
ple, MCC has been able to find schedulable implementations for 
83% of the applications. The bottom bar, corresponding for 26%, 
is the percentage of schedulable applications found by IPM. On 
top of that, PMH, depicted by a black bar, adds another 47%. 

Figure 9b presents the execution times for the IPM and PMH 
steps of our multi-cluster configuration strategy, as well as for 
the complete algorithm (MCC). Note that the times presented in 
the figure for MCC include a complete optimization loop, that 
performs partitioning, mapping, bus access optimization and 
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scheduling. The complete optimization process implemented by 
the MCC strategy takes under five hours for very large process 
graphs of 250 processes, while for applications consisting of 100 
processes it takes on average 2 minutes. 

9.3.1 The Vehicle Cruise Controller 

Finally, we considered the cruise controller example presented 
in Section 2.3.3: 

• The model for the cruise controller is presented in Figure 2.9 
on page 40. 

• The architecture, consisting of a TTC and an ETC, each with 2 
nodes, interconnected by the CEM node, is depicted in 
Figure 2.7b on page 37. 

• The software architecture for multi-cluster systems, used by 
the CC, is presented in Section 3.5. 

• We have considered a deadline of 150 ms. 

In this setting, the SF approach failed to produce a schedula- 
ble implementation, leading to response time of 392 ms. After 
IPM (second step of MCC), we were able to reduce the response 
time of the MCC to 154, which is still larger than the deadline. 
However, applying PMH (step three) we are able to obtain a 
schedulable implementation with a response time of 146 ms. 
The ceomplete MCC executes for under two minutes for the CC. 



The evaluation using synthetic applications, as well as the real- 
life example, show that by using our optimization approaches for 
the partitioning and mapping problem, we are able to find 
schedulable implementations under limited resources, achiev- 
ing an efficient utilization of the system. 
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The previous chapters have presented analysis methods 
for communication-intensive heterogeneous real-time systems, 
taking into account the details of the communication protocols, 
in our case CAN and TTP. 

We have, however, not addressed the issue of frame packing, 
which is of utmost importance in cost-sensitive embedded sys- 
tems where resources, such as communication bandwidth, have 
to be fully utilized [Kop95], [SanOO], [Raj 98]. In both TTP and 
CAN protocols messages are not sent independently, but several 
messages having similar timing properties are usually packed 
into frames. In many application areas like, for example, auto- 
motive electronics, messages range from one single bit (e.g., the 
state of a device) to a couple of b 5 rtes (e.g., vehicle speed, etc.). 
Transmitting such small messages one per frame would create a 
high communication overhead, which can cause long delays 
leading to an unschedulable system. For example, 65 bits have 



271 




Chapter 10 



to be transmitted on CAN for delivering one single bit of applica- 
tion data. Moreover, a given frame configuration defines the 
exact behavior of a node on the network, which is very important 
when integrating nodes from different suppliers. 

The issue of frame packing (sometimes referred to as fr am e 
compiling) has been previously addressed separately for the CAN 
and the TTP. In [SanOO], [Raj98] CAN frames are created based on 
the properties of the messages, while in [Kop95] a “cluster com- 
piler” is used to derive the frames for a TT system which uses TTP 
as the communication protocol. However, researchers have not 
addressed frame packing on multi-cluster systems implemented 
using both ET and TT clusters, where the interaction between the 
ET and TT processes of a hard real-time application has to be 
very carefully considered in order to guarantee the timing con- 
straints. As our multi-cluster scheduling strategy in Section 8.2 
shows, the issue of frame packing cannot be addressed sepa- 
rately for each type of cluster, since the inter-cluster communi- 
cation creates a circular dependency. 

Therefore, in this chapter, we concentrate on the issue of pack- 
ing messages into frames, for multi-cluster distributed embed- 
ded systems consisting of time-triggered and event-triggered 
clusters, interconnected via gateways. We are interested to 
obtain that frame configuration which would produce a schedu- 
lable system. We have updated our schedulability analysis pre- 
sented in Section 8.2 to account for the frame packing, and we 
have proposed two optimization heuristics that use the schedu- 
lability analysis as a driver towards a frame configuration that 
leads to a schedulable system. 

The chapter is organized in three sections. The next section 
presents the exact formulation of the problem that we are 
addressing in this chapter. Section 10.2 updates the schedulabil- 
ity analysis for multi-clusters developed in the previous chapter, 
and uses it to drive the optimization heuristics used for frame 
generation. The last section presents the experimental results. 
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10.1 Problem Formulation 

As input to our problem we have an application F given as a set 
of conditional process graphs mapped on an architecture consist- 
ing of a TTC and an ETC interconnected through a gateway. 

As part of our frame packing approach, we are interested to 
generate all the MEDLs on the TTC (i.e., the TT frames and the 
sequence of the TDMA slots), as well as the ET frames and their 
priorities on the ETC such that the global system is schedulable. 

More formally, we are interested to find a mapping of mes- 
sages to frames (a frame packing configuration) denoted by a 4- 
tuple \|^=<o(, 71, p, o such that the application F is schedulable. 
Once a schedulable system is found, we are interested to further 
improve the degree of schedulability (defined in Section 6.6.1), 
so the application can potentially be implemented on a cheaper 
hardware architecture (with slower buses and processors). 
Determining a frame configuration v|7 means deciding on: 

• The mapping of application messages transmitted on the ETC 
to frames (the set of ETC frames a), and their relative priori- 
ties K. Note that the ETC frames a have to include messages 
transmitted from an ETC node to a TTC node, messages trans- 
mitted inside the ETC cluster, and those messages transmit- 
ted from the TTC to the ETC. 

• The mapping of messages transmitted on the TTC to frames, 
denoted by the set of TTC frames P and the sequence a of slots 
in a TDMA round. The slot sizes are determined based on the 
set P, and are calculated such that they can accommodate the 
largest frame sent in that particular slot. We consider that 
messages transmitted from the ETC to the TTC are not stati- 
cally allocated to frames. Rather, we will d 5 mamically pack 
messages originating from the ETC into the “gateway frame,” 
for which we have to decide the data field length. 

Example 10.1: Let us consider the example in Figure 10.1, 
where we have the process graph G in Figure 10. Id mapped 
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on the two-cluster system as indicated in Figure 10. le. 

In the system configuration of Figure 10.1a we consider that, 
on the TTP hus, the node Ni transmits in the first slot (Si) of 
the TDMA round, while the gateway transmits in the second 
slot (Sq). Process P3 has a higher priority than process P 2 , 
hence P 2 will be interrupted by P3 when it receives message 
m 2 - In such a setting, P4 will miss its deadline, which is 
depicted as a thick vertical line in Figure 10.1. Changing the 
frame configuration as in Figure 10.1b, so that messages m-^ 
and m 2 are packed into frame and slot Sq of the gateway 
comes first, processes P2 and P3 will receive /n^ and m 2 
sooner and thus reduce the worst-case response time of the 
process graph, which is still larger than the deadline. In 
Figure 10.1c, we also pack m3 and m^^ into f 2 - In such a situa- 
tion, the sending of m3 will have to be delayed until m4 is 
queued by P2. Nevertheless, the worst-case response time of 
the application is further reduced, which means that the 
deadline is met, thus the system is schedulable. 

However, packing more messages will not necessarily reduce 
the worst-case response times further, as it might increase 
too much the worst-case response times of messages that 
have to wait for the frame to be assembled, like is the case 
with message m3 in Figure 10.1c. We are interested to find a 
frame packing that leads to a schedulable system. 



10.2 Frame Packing Strategy 

We have updated the schedulability analysis for an ETC cluster, 
presented in Section 8.2, to consider frames. We consider that the 
response time of a message m is equal to the response time of the 
frame fin which message m is transmitted. The response time of 
a frame f is calculated similar to a the worst-case response time 
for a message in Section 8.2, with the following exceptions: 
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• The size of the frame is calculated taking into account the 
exact frame configuration for TTP and CAN (see Figures 3.3 
and 3.4 on page 49 and page 50, respectively) and the size of 
the messages packed into the frame. 

• The jitter of a frame / is, in the worst case, equal to the larg- 
est worst case response time rg(^) of a sender process Ps{m) 
which sends message m packed into frame /: 



j _ max /„ 



( 10 . 1 ) 



For the scheduling of a multi-cluster system we use the same 
algorithm as in Figure 8.2. Once we have a technique to deter- 
mine if a system is schedulable, we can concentrate on optimiz- 
ing the packing of messages to frames. 

Such an optimization problem is NP complete [SanOO], thus 
obtaining the optimal solution is not feasible. We propose two 
frame packing optimization strategies, one based on a simulated 
annealing approach, while the other is based on a greedy heuris- 
tic that uses intelligently the problem-specific knowledge in 
order to explore the design space. 

In order to drive our optimization algorithms towards schedu- 
lable solutions, we characterize a given frame packing configu- 
ration using the degree of schedulability of the application, as 
presented in Section 6.6.1. 



10.2.1FRAME Packing with Simulated Annealing 

The first algorithm we have developed is based on a simulated 
annealing (SA) strategy, see Appendix A. As discussed before, an 
essential component of an SA algorithm is the generation of a 
new solution x’ starting from the current one (Figure A. 1 in 
Appendix A). The neighbors of the current solution are 
obtained by performing transformations (called moves) on the 
current frame configuration \\f We consider the following moves: 
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• moving a message m from a frame /\ to another frame f2 (or 
moving m into a separate single-message frame); 

• swapping the priorities of two frames in ot; 

• swapping two slots in the sequence a of slots in a TDMA 
round. 

10.2.2FRAME Packing Greedy Heuristic 

The OptimizeFramePacking greedy heuristic (Figure 10.2) con- 
structs the solution by progressively selecting the best candidate 
in terms of degree of schedulability. 

We start by observing that all activities taking place in a 
multi-cluster system are ordered in time using the offset infor- 
mation, determined in the StaticScheduling function (see 
Section 8.2) based on the response times known so far and the 
application structure (i.e., the dependencies in the process 
graphs). Thus, our greedy heuristic outlined in Figure 10.2, 
starts with building two lists of messages ordered according to 
the ascending value of their offsets, one for the TTC, messages^, 
and the other for ETC, messages^. Our heuristic is to consider for 
packing in the same frame messages which are adjacent in the 
ordered lists. 

Example 10.2: For example, let us consider that we have 
three messages, of 1 byte, m.2 of 2 bytes and of 3 b5d;es, 

and that messages are ordered as m^, rrii, m2 based on the 
offset information. Also, assume that our heuristic has sug- 
gested two frames, frame f-^ with a data field of 4 bytes, and f2 
with a data field of 2 bytes. 

The PackMessages function will start with and pack it in 
frame f^. It continues with m2, which is also packed into f^, 
since there is space left for it. Finally, m3 is packed in f2, 
since there is no space left for it in/j. 
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OptimizeFramePacking(r) 

1 - given an application F, find out if it is schedulable and produce the 

2 — configuration \\f= <o, n, (3, a> leading to the smallest 5^ 

3 — build the message lists ordered ascending on their offsets 

4 messagesp = ordered list of rip messages on the TTC 

5 messageS(^= ordered list of messages on the ETC 

6 — build an initial frame configuration \|/= <o, n, (3, cs> 

7 (3 = messagesp; a = messages^^ -- initially, each frame carries one message 

8 ” determine an initial TDMA slot sequence a 

9 for each slot S, e odo S, = A/,; sizeg^ = sizei^^g^si message end for 

1 0 Tijpjtjai = HOPA - calculate the priorities n according to the HOPA heuristic 

1 1 - find the best allocation of slots, the TDMA slot sequence c^urrent 

12 for each slot Sj e Ocurrent 

13 for each node A/y eTTC do 

^current'^i= ^currenf^j- ~~ allocate Nj tentatively to Si, Ni gets slot Sj 

15 " determine the best frame packing configuration (3 for the TTC 

16 for each ^current having a number of 1 to np frames do 

1 7 for each frame fj e current 

18 - determine the best frame size for fi 

19 for each frame size S^e RecomendedSizes(messagesp) do 

2 0 ^current ^i- ^ ~ 

21 - determine the best frame packing configuration a for the ETC 

22 for each acurrent having a number of 1 to n^^ frames do 

2 3 for each frame fj e oLcurrent 

24 — determine the best frame size for fj 

2 5 for each frame size RecomendedSizes(messagesJ do 

2 ^ ^current’ ^j- ^ ~ ^current ~ "^^currenP ^initial’ ^currenP ^current^ 

2 7 PackMessages(\|(,^^^^e^^, messagesp u messages^;) 

2 8 5p = MultiClusterScheduling(F, M ^current) 

2 9 - remember the best configuration so far 

3 0 if S^(\\fcurrent) 'S beSt SO far then\l^^3^ = \^current^^<^ 

3 1 end for 

3 2 end for 

3 3 if exists 

3 4 then acurrent- ^ = ^ize of frame fj in the configuration if 

3 5 end for 

3 6 if exists then acurrent = frame set a in the configuration end if 

3 7 end for 

3 8 end for 

3 9 if %es\ exists then ^current- = ^izo of frame fj in the configuration end if 

4 0 end for; if exists then ^current = frame set p in the configuration end if 

4 1 end for 

42 if \K,est exists 

43 then Ocurrenf^i= *^ede in the slot sequence oin the configuration ^g^^end if 

44 end for 

45 return SchedulabilityTest(F, %est)^%est 
end OptimizeFramePacking 

Figure 10.2: The OptimizeFramePacking Algorithm 
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The OptimizeFramePacking tries to determine, using the for-each 
loops in Figure 10.2, the allocation of frames, i.e., the number of 
frames and their sizes, for each cluster. The actual mapping of 
messages to frames will be performed by the PackMessages func- 
tion as described previously. 

As an initial TDMA slot sequence qnitiai on the TTC, 
OptimizeFramePacking assigns nodes to the slots and fixes the slot 
length to the minimal allowed value, which is equal to the length 
of the largest message generated by a process assigned to N|, 
sizesi = sizeiargesLmessage dine 9 in Figure 10.2). 

Then, the algorithm looks, in the innermost for-each loops, for 
the optimal frame configuration a (lines 21—35). This means 
deciding on how many frames to include in a (line 22), and which 
are the best sizes for them (lines 24-31). In a there can be any 
number of frames, from one single frame to frames (in which 
case each frame carries one single message). Hence, several 
numbers of frames are tried, each tested for a recommended size 
S, to see if it improves the current configuration. The 
RecomendedSizes(messagesJ list is built recognizing that only 
messages adjacent in the messages^ list will be packed into the 
same frame. Sizes of frames are determined as a sum resulted 
from adding the sizes of combinations of adjacent messages, not 
exceeding 8 bytes. 

Example 10.3: For the previous example, with m^, m 2 and 
m3, of 1, 2 and 3 b5rtes, respectively, the frame sizes recom- 
mended will be of 1, 2, 3, 5, and 6 bytes. A size of 4 bytes will 
not be recommended since there are no adjacent messages 
that can be summed together to obtain 4 bytes of data. 

■ 

Once a configuration Obest for the ETC, minimizing 5p, has been 
determined (considering for n, p, athe initial values determined 
at the beginning of the algorithm), the algorithm looks for the 
frame configuration p which will further improve 5p (the loop 
consisting of lines 15 to 41). The degree of schedulability 5p (the 
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smaller the value, the more schedulable the system) is 
calculated based on the response times produced by the 
MultiClusterScheduling algorithm (see Section 8.2) in line 28. After 
a Pbest been decided, the algorithm looks for a slot sequence 
Q starting with the first slot and tries to find the node which, 
when transmitting in this slot, will reduce 5p (loop 11-44). The 
algorithm continues in this fashion, recording the best ever \|{,est 
configurations obtained in terms of 6p, and thus, the best 
solution ever is reported when the algorithm finishes. In the 
inner loops of the heuristic we will not change the frame 
priorities 7i|ni,iai set at the beginning of the algorithm, on line 10. 

10.3 Experimental Evaluation 

For the evaluation of our algorithms we first used process appli- 
cations generated for experimental purpose. Similar to the 
experimental setup in Chapter 8, we considered two-cluster 
architectures consisting of 2, 4, 6, 8 and 10 nodes, half on the TTC 
and the other half on the ETC, interconnected by a gateway. 
Forty processes were assigned to each node, resulting in applica- 
tions of 80, 160, 240, 320 and 400 processes. Message sizes were 
randomly chosen between 1 bit and 2 b 5 des. Thirty examples 
were generated for each application dimension, thus a total of 
150 applications were used for experimental evaluation. Worst- 
case execution times and message lengths were assigned ran- 
domly using both uniform and exponential distribution. For the 
communication channels we considered a transmission speed of 
256 Kbps and a length below 20 meters. All experiments were 
run on a SUN Ultra 10. 

The first result concerns the ability of our heuristics to pro- 
duce schedulable solutions. We have compared the degree of 
schedulability 5p obtained from our OptimizeFramePacking (OFF) 
heuristic (Figure 10.2) with the near-optimal values obtained by 
the simulated annealing algorithm SA. Obtaining solutions that 
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Figure 10.3: Evaluation of the 
Frame Packing Heuristics 
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have a higher degree of schedulability means obtaining tighter 
response times, increasing the chances of meeting the deadlines. 

Figure 10.3a presents the average percentage deviation of the 
degree of schedulability produced by OFP from the near-optimal 
values obtained with SA. Together with OFP, a straightforward 
approach (SF) is presented. The SF approach does not consider 
frame packing, and thus each message is transmitted indepen- 
dently in a frame. Moreover, for SF we considered a TTC bus con- 
figuration consisting of a straightforward ascending order of 
allocation of the nodes to the TDMA slots; the slot lengths were 
selected to accommodate the largest message frame sent by the 
respective node, and the scheduling has been performed by the 
MultiClusterScheduling algorithm in Figure 8.2. 

Figure 10.3a shows that when packing messages to frames, 
the degree of schedulability improves dramatically compared to 
the straightforward approach. The greedy heuristic 
OptimizeFramePacking performs well for all the graph dimensions, 
having run-times which are more than two orders of magnitude 
smaller than with SA. 

When deciding on which heuristic to use for design space 
exploration or system synthesis, an important issue is the execu- 
tion time. In average, our optimization heuristics needed a cou- 
ple of minutes to produce results, while the simulated annealing 
approach had an execution time of up to 6 hours (see 
Figure 10.3b). 

10.3. IThe Vehicle Cruise Controller 

Finally, we considered the cruise controller example presented 
in Section 2.3.3: 

• The model for the cruise controller is presented in Figure 2.9 
on page 40. 

• The architecture, consisting of a TTC and an ETC, each with 2 
nodes, interconnected by the CEM node, is depicted in 
Figure 2.7b on page 37. 
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• The software architecture for multi-cluster systems, used by 
the CC, is presented in Section 3.5. 

• We considered one mode of operation with a deadline of 250 
ms. 

In this context, the straightforward approach SF produced an 
end-to-end response time of 320 ms, greater than the deadline, 
while both the OFF and SA heuristics produced a schedulable sys- 
tem with a worst-case response time of 172 ms. 

This shows that the optimization heuristic proposed, driven 
by our schedulability analysis, is able to identify that frame 
packing configuration which increases the schedulability degree 
of an application, allowing the developers to reduce the imple- 
mentation cost of a system. 
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Simulated Annealing is an optimization heuristic that 
tries to find the global optimum by randomly selecting a new 
solution from the neighbors of the current solution [Ree93] . 

The approach derives its name from the process of crystalliza- 
tion of materials. If a material is heated past its melting point 
and then cooled, the rate of cooling the material can influence its 
structural properties: a too fast cooling introduces imperfec- 
tions. This process is called annealing, hence the name simu- 
lated annealing (SA). Researchers have suggested that this type 
of simulation can be used to solve optimization problems. 

The SA algorithm is a variant of the neighborhood search tech- 
nique, where the local search space is explored by moving from 
the current solution to a neighbor solution. In general, the new 
solution is accepted if it is an improved one. However, in the case 
of SA, a worse solution can also be accepted with a certain prob- 
ability that depends on the deterioration of the cost function and 
on a control parameter called temperature which is analog to the 
temperature concept of the physical annealing process. 

In Figure A. 1 we give a short description of this algorithm. 
The algorithm starts with constructing an initial solution. How 
this initial solution is constructed depends on the particular 
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SimulatedAnnealing 

1 construct an initial solution x"®"' 

2 temperature = initial temperature Tl 

3 

4 repeat 

5 for i = 1 to temperature length TL do 

6 generate randomly a neighboring solution y of x^°'^ 

7 delta = CostFunction{)() - CostFunction{xP°'^) 

8 if delta < 0 then xP°'^ = y 

9 else 

10 generate q= random (0, 1) 

11 if q < ' t^'Tiperatwe = y end if 

1 2 end if 

1 3 end for 

1 4 temperature = e * temperature 

15 until stopping criterion is met 

16 

17 return solution corresponding to the best CostFunction 
end SimulatedAnnealing 

Figure A.l The Simulated Annealing Strategy 



problem that has to be solved. In general, it is sufficient to gen- 
erate an arbitrary valid solution. 

An essential component of the algorithm is the generation of a 
new solution x’ starting from the current one (line 5 in the 
algorithm). The generation of the neighbor solution x’ depends 
on the details of the optimization problem and the internal rep- 
resentation of a solution. The neighbor solutions x’ are gener- 
ated through performing design transformations on The 
design transformations applied depend on what parts of a sys- 
tem we are interested to S5mthesize. 

For the implementation of the simulated annealing algorithm, 
the parameters Tl (initial temperature), TL (temperature 
length), £ (cooling ratio), and the stopping criterion have to be 
determined. They define the so called cooling schedule and have 
a decisive impact on the quality of the solutions and the CPU 
time consumed. The temperature length TL determines how 
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many iterations the loop comprised of lines 5-14 will perform at 
a certain temperature, and the cooling ratio e will decide how 
fast the temperature will drop, in each iteration of the 4-15 
repeat loop, starting from the initial temperature Tl. 

In our experiments, we were interested to obtain values for TI, 
TL and e that will guarantee the finding of near-optimal solu- 
tions in an acceptable time. In order to tune the optimization 
parameters Tl, TL, and e we have first performed very long and 
expensive runs on selected large examples and the best ever 
solution, for each example, has been considered as the near-opti- 
mum. Based on further experiments we have determined the 
parameters of the SA algorithm, for different sizes of examples, 
so that the optimization time is reduced as much as possible but 
the near-optimal result is still produced. These parameters 
(tuned to different values for each experimental setup) have 
been then used for the large scale experiments presented in the 
experimental sections of each chapter. 
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List of Notations 



Application 

Fj An application composed of several conditional 

process graphs 



5p 

Rr 

Q. 

R(a) 

St 

Su 



Degree of schedulability of application F 
Modification cost of application F 
Subset of applications 

Modification cost of the applications in subset Q 

Dependency graph of applications; £ is the set of 
edges, whereas “Fis the set of nodes 

An edge e T, denoting that application F^ 
depends on application F ^ 

Set of possible worst-case execution times char- 
acterizing the execution time of processes belong- 
ing to an application 

The occurrence probability of a worst-case execu- 
tion time t e Sf 

Set of possible worst-case utilizations character- 
izing the processes of an application 

The occurrence probability of a worst-case utili- 
zation U s Su 
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T 

^ min 



^need 

L 

^need 



Set of possible message sizes for messages in an 
application 

The occurrence probability of a message size 
b s Sb 

Smallest expected period characterizing an 
application 

Expected necessary processor time for an appli- 
cation inside a period 

Expected necessary bus bandwidth for an appli- 
cation inside a period 



Conditional Process Graph 

gi Conditional process graph gfV, Eg, Eq) 

V Set of nodes in the conditional process graph 

Eg Set of simple edges in the conditional process graph 

Eg Set of conditional edges in the conditional pro- 

cess graph 

E The set of all edges in the conditional process 

graphj Eg E^ — E 

6ij Edge in the E set, from Pi to Pj indicating that 

the output of Pj is the input of Pj 

Gi Mapped conditional process graph 

Tq. Period of the mapped conditional process graph Gi 

Dq. End-to-end deadline on the mapped conditional 

process graph Gi 

Tj Process graph without conditions 

gi A trace through a conditional process graph for a 

given combination of conditions 
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Process 



Pi Process 

M(Pi) Resource executing process P^ 

9{p. The set of nodes where Pi could potentially be 

mapped 

Xp. Guard associated to process Pj 

Cl Worst-case execution time of process Pj when 

executing on the resource M(Pi) 

Ui Utilization due to process Pj 

Tj Period of process Pj 

priority Priority of process Pj 

Dj Deadline of process Pj 

ri Worst-case response time of process Pj 

Ji Jitter of process Pj 

hpiPi) Set of processes having a higher priority than 
priorityp. 

IpiPj) Set of processes having a lower priority than 
priorityp^ 

1 1 The interference of on the execution of process Pj 

due to processes having a higher priority than 
priorityp^ 

u>i The width of the level-i busy period 

Pj The blocking time experienced by process Pj due 

to processes having a lower priority than 
priorityp. 

ASAPiPi) The as-soon-as-possible start time of process Pj 
ALAPiPi) The as-late-as-possible start time of process Pj 
q Number of level-i busy periods 
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Message 

m 

S(m) 

Dim) 

Sm 

Cm 

Tm 

priority^ 

hpim) 

Ipim) 

I W 



Pm 

f 



P 



Message 

The sender process of message m 
The destination process of message m 
Size of message m 

Worst-case transmission time of message m 
Period of process message m 
Priority of message m 
Worst-case response time of message m 
Jitter of message m 

Set of processes having a higher priority than 
priority^ 

Set of processes having a lower priority than 
priority^ 

The interference/the worst-case queuing time of 
message m due to messages having a higher pri- 
ority than priority 

The width of the level-m busy period 

The interference on the worst-case transmission 
time experienced by message m due to processes 
having a lower priority than priority 

Number of packets of message m 

Frame 

Jitter of frame f 
Packet 
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System Configuration 

\|/ A system configuration 

9^ The TT cluster set of nodes 

9^ The ET cluster set of nodes 

Ni A node in the hardware architecture 

Nq The gateway node 

pei Processing element 

'^cycie Time length of a tdma cycle 
'^TDMA Time length of a TDMA round 
Si The slot of a TDMA round 

Sg Size of the data field of the largest frame that 

can be sent in slot <S of a TDMA round 

0^ Maximum time between two consecutive slots of 

a TDMA cycle carr 5 dng message m 

s Transmission speed of a bus 

(j) Set of offsets 

71 Set of priorities 

Set of processes 

a Set of frames on a CAN bus 

P TDMA round configuration; set of frames on a TTP 

bus determining the TDMA configuration 

a Sequence and size of slots in a TDMA round con- 

figuration 

Out, Outj^. Queue with messages awaiting transmission on 
the hardware node Ni 

OutjYP Queue with messages awaiting transmission on 
the TTP bus from the gateway node of a multi- 
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cluster system 

Out^j^ Queue with messages awaiting transmission on 
the CAN bus from the gateway node of a multi- 
cluster system 

Size of an outgoing queue 

^totai Total size of all outgoing queues 

C Cost function, design metric 
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List of Abbreviations 



ABS 


Anti-lock Braking System 


ACK 


Acknowledgment 


ALAP 


As Late As Possible 


ASAP 


As Soon As Possible 


ASIC 


Application Specific Integrated Circuit 


ASIP 


Application Specific Instruction Processor 


CAD 


Computer Aided Design 


CAN 


Controller Area Network 


cc 


Cruise Controller 


CEM 


Central Electronic Module 


CPG 


Conditional Process Graph 


CPU 


Central Processing Unit 


CRC 


Cyclic Redundancy Check 


DSP 


Digital Signal Processor 


ECM 


Engine Control Module 


EOF 


End Of Field 
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ET 


Event Triggered 


ETC 


Event Triggered Cluster 


ETM 


Electronic Throttle Module 


FIFO 


First In First Out 


FPGA 


Filed Programmable Gate Array 


EPS 


Fixed Priority Scheduling 


IFD 


Inter Frame Delimiter 


MBI 


Message Base Interface 


MDEL 


Message Descriptor List 


MHTT 


Message Handling Time Table 


MPCP 


Modified Partial Critical Path 


PCP 


Partial Critical Path 


RAM 


Random Access Memory 


ROM 


Read Only Memory 


SA 


Simulated Annealing 


SOF 


Start Of Frame 


TCM 


Transmission Control Module 


TDMA 


Time Division Multiple Access 


TT 


Time Triggered 


TTC 


Time Triggered Cluster 


TTP 


Time Triggered Protocol 


VLSI 


Very Large Scale Integration 


WCAO 


Worst Case Administrative Overhead 


WCET 


Worst Case Execution Time 
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Index 



Application 

distributed 9 
existing 33, 99 
frozen 34 
future 100, 204 
graph 33 
hard real-time 7 
mode 139 

modification cost 34, 117 
safety-critical 7 
schedulable 141 
soft real-time 7 

Architecture 
distributed 7 
heterogeneous 8 
node 8, 44 
platform 10, 22, 44 
selection 10 

Automotive electronics 4 
body electronics 45 
market 4 

system electronics 45 
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Buffer optimization 245 
Bus access optimization 83, 177 
Busy period 147 
Byteflight 46 

Co-design 

function/architecture 20 
hardware/software 19 
Communication 
delay 142 
end-to-end 142 
queuing delay 142 
transmission delay 142 
Communication protocol 
collision avoidance 49 
communication controller 47 
frame 48 
slot 47 

time-division multiple access 47, 140 
time-triggered protocol 46, 47 
Communication synthesis 10, 19, 68, 83, 177 
Control dependency 151 
Controller area network protocol 46, 49, 140 
Cost 

average percentage deviation 186 
conflict and flexibility metric 120, 218 
degree of schedulability 176 
end-to-end delay 263 
function 106, 211 
modification 34, 117 
schedule length 75 
slack distribution metric 104, 209 
slack sizes metric 102, 208 
utilization distribution metric 207 
worst-case system delay 75, 263 
Critical path 73, 261 

Cruise controller 36, 90, 131, 194, 228, 254, 270, 282 
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Data dependency 148 
Dataflow process network 27 
Design methodology 
circuit level 15 

function/architecture co-design 20 
hardware/software co-design 19 
incremental design 11, 24, 94, 203 
logic level 15 
platform-based 10, 22 
productivity gap 16 
register-transfer level 15 
system level 15 
traditional 18 
waterfall model 18 
Design productivity gap 16 
Design task 

architecture selection 10 
buffer optimization 245 
bus access optimization 83, 177 
communication synthesis 10, 19, 83, 177 
frame compiling 272 
frame packing 272 
mapping 10, 94, 203, 256 
partitioning 256 
scheduling 10, 236 
Digital system 1 
Distributed application 7, 9 
Dynamic offset 139 

Embedded system 2 
Engine control module 5 
Event-driven system xviii, 11, 43, 54 
Event-triggered system 42 
Exhaustive search 117 
Existing application 33, 99 

Fixed priority pre-emptive scheduling 140 
FlexRay 47 
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Frame compiling 272 
Frame packing 272 
Frozen application 34 
Function/architecture co-design 20 
Future applications 100, 204 

Gateway 8, 44 

Graph merging 111 

Greedy heuristic 84, 118, 178, 277 

Hard real-time application 7 
Hardware platform 10, 22, 44 
Hardware/software co-design 19 
Heterogeneous architecture 8 
Heuristic 

greedy 84, 118, 178, 277 
hill-climbing 249 
iterative improvement 112, 215 
list scheduling 71 

simulated annealing 86, 125, 182, 250, 280, 285 
Hill-cimbing heuristic 249 

Incremental design process 11, 24, 94, 203 
Iterative improvement heuristic 112, 215 

List scheduling 71, 261 
Lon Works 46 

Mapping 10, 94, 203, 256 

Message passing 51, 55 

Message scheduling 165 

Middleware 9 

Modification cost 34, 117 

Modified partial critical path priority 82 

Multi-cluster system 43 

Offset 139, 142, 148 
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Partial critical path priority 73 

Partitioning 256 

Platform 

hardware 10, 22, 44 
instance 23 

platform-based design 10, 22 
Platform-based design 10, 22 
Priority function 73 
Process 27 

arrival time 142 
blocking time 142 
busy period 147 
conjunction 30 
deadline 32 
disjunction 30 
dynamic offset 139 
finishing time 142 
guard 30, 153 
input 30 

interference 142, 146 
offset 139, 142, 148 
output 30 
period 32 
phase 139, 148 
priority 32 

release jitter 139, 142 
start time 142 
Process graph 27, 28 

communication process 29 
conditional 28 
conditional edge 30 
conjunction node 30 
disjunction node 30 
execution semantic 30 
merging 111 
sink node 28 
source node 28 
unconditional subgraph 153 
Profibus 46 
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Queuing delay 142, 241 

Real-time system 7 

Release jitter 139, 142 

Response time 142, 146 

Response time analysis 140, 141, 146, 241 

Safety-critical application 7 
Schedulability analysis 
CAN protocol 162 
control dependency 151 
distributed systems 161 
necessary 141 

response time analysis 140, 141, 146, 241 
sufficient 141 
TDMA bus 163 
time-triggered protocol 165 
utilization tests 141 
Schedule 

feasible 141 
table 69 

Scheduling 10, 75, 236 

fixed priority pre-emptive 140 
list scheduling 71, 261 
messages 76 
non pre-emptive 31 
pre-emptive 31 
static cyclic 66 

worst-case queuing delay 241 
Simulated annealing 86, 182, 250, 280, 285 
Simulated annealing heuristic 125 
Soft real-time application 7 
Static cyclic scheduling 66 
Synthesis 20 
automatic 11 
communication 10, 68 
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System 
digital 1 
embedded 2 

event-driven xviii, 11, 43, 54 
event-triggered 42 
multi-cluster 43 
real-time 7 
single rate 31 

time-driven xviii, 11, 43, 50 
time-triggered 42 
System-level design 15 

Task graph 27 

Time-driven system xviii, 11, 43, 50 
Time-triggered protocol 46, 47 
Time-triggered system 42 
Trigger 42 

Worst-case execution time 32 
Worst-case queuing delay 142, 241 
Worst-case response time 142, 146 
Worst-case system delay 75 
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